• Stars
    star
    104
  • Rank 330,604 (Top 7 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created over 6 years ago
  • Updated about 1 month ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

locopy: Loading/Unloading to Redshift and Snowflake using Python.

locopy: Data Load and Copy using Python

A Python library to assist with ETL processing for:

  • Amazon Redshift (COPY, UNLOAD)
  • Snowflake (COPY INTO <table>, COPY INTO <location>)

In addition:

  • The library supports Python 3.8 to 3.10
  • DB Driver (Adapter) agnostic. Use your favourite driver that complies with DB-API 2.0
  • It provides functionality to download and upload data to S3 buckets, and internal stages (Snowflake)

Quick Installation

pip install locopy

or install from conda-forge

conda config --add channels conda-forge
conda install locopy

Installation instructions

A virtual or conda environment is highly recommended

$ virtualenv locopy
$ source locopy/bin/activate
$ pip install --upgrade setuptools pip
$ pip install locopy

Python Database API Specification 2.0

Rather than using a specific Python DB Driver / Adapter for Postgres (which should supports Amazon Redshift or Snowflake), locopy prefers to be agnostic. As an end user you can use any Python Database API Specification 2.0 package.

The following packages have been tested:

  • psycopg2
  • pg8000
  • snowflake-connector-python

You can use which ever one you prefer by importing the package and passing it into the constructor input dbapi.

Usage

You need to store your connection parameters in a YAML file (or pass them in directly). The YAML would consist of the following items:

# required to connect to redshift
host: my.redshift.cluster.com
port: 5439
database: db
user: userid
password: password
## optional extras for the dbapi connector
sslmode: require
another_option: 123

If you aren't loading data, you don't need to have AWS tokens set up. The Redshift connection (Redshift) can be used like this:

import pg8000
import locopy

with locopy.Redshift(dbapi=pg8000, config_yaml="config.yml") as redshift:
    redshift.execute("SELECT * FROM schema.table")
    df = redshift.to_dataframe()
print(df)

If you want to load data to Redshift via S3, the Redshift class inherits from S3:

import pg8000
import locopy

with locopy.Redshift(dbapi=pg8000, config_yaml="config.yml") as redshift:
    redshift.execute("SET query_group TO quick")
    redshift.execute("CREATE TABLE schema.table (variable VARCHAR(20)) DISTKEY(variable)")
    redshift.load_and_copy(
        local_file="example/example_data.csv",
        s3_bucket="my_s3_bucket",
        table_name="schema.table",
        delim=",")
    redshift.execute("SELECT * FROM schema.table")
    res = redshift.cursor.fetchall()

print(res)

If you want to download data from Redshift to a CSV, or read it into Python

my_profile = "some_profile_with_valid_tokens"
with locopy.Redshift(dbapi=pg8000, config_yaml="config.yml", profile=my_profile) as redshift:
    ##Optionally provide export if you ALSO want the exported data copied to a flat file
    redshift.unload_and_copy(
        query="SELECT * FROM schema.table",
        s3_bucket="my_s3_bucket",
        export_path="my_output_destination.csv")

Note on tokens

To load data to S3, you will need to be able to generate AWS tokens, or assume the IAM role on a EC2 instance. There are a few options for doing this, depending on where you're running your script and how you want to handle tokens. Once you have your tokens, they need to be accessible to the AWS command line interface. See http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html#config-settings-and-precedence for more information, but you can:

  • Populate environment variables AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, etc.
  • Leverage the AWS credentials file. If you have multiple profiles configured you can either call locopy.Redshift(profile="my-profile"), or set up an environment variable AWS_DEFAULT_PROFILE.
  • If you are on a EC2 instance you can assume the credentials associated with the IAM role attached.

Advanced Usage

See the docs for more detailed usage instructions and examples including Snowflake.

Contributors

We welcome and appreciate your contributions! Before we can accept any contributions, we ask that you please be sure to sign the Contributor License Agreement (CLA).

This project adheres to the Open Source Code of Conduct. By participating, you are expected to honor this code.

Roadmap

Roadmap details can be found here

More Repositories

1

DataProfiler

What's in your data? Extract schema, statistics and entities from datasets
Python
1,426
star
2

react-native-pathjs-charts

Android and iOS charts based on react-native-svg and paths-js
JavaScript
878
star
3

datacompy

Pandas, Polars, and Spark DataFrame comparison for humans and more!
Python
480
star
4

cqrs-manager-for-distributed-reactive-services

Experimental CQRS and Event Sourcing service
Java
304
star
5

SWHttpTrafficRecorder

A simple library empowering you to record/capture HTTP(s) traffic of an iOS app for mocking/stubbing later.
Objective-C
205
star
6

fpe

A format-preserving encryption implementation in Go
Go
202
star
7

rubicon-ml

Capture all information throughout your model's development in a reproducible way and tie results directly to the model code!
Jupyter Notebook
127
star
8

giraffez

User-friendly Teradata client for Python
Python
108
star
9

checks-out

Checks-Out pull request approval system
Go
76
star
10

dataCompareR

dataCompareR is an R package that allows users to compare two datasets and view a report on the similarities and differences.
R
75
star
11

stack-deployment-tool

Go
66
star
12

bash_shell_mock

A shell script mocking utility/framework for the BASH shell
Shell
66
star
13

architecture-viewer

Visualize your PlantUML sequence diagrams as interactive architecture diagrams!
JavaScript
60
star
14

go-future-context

A simple Future (Promise) library for Go.
Go
54
star
15

AI_Dictionary_English_Spanish

TeX
49
star
16

acronym-decoder

Acronym Decoder
TypeScript
43
star
17

synthetic-data

Generating complex, nonlinear datasets appropriate for use with deep learning/black box models which 'need' nonlinearityโ€ฉ
Python
43
star
18

Particle-Cloud-Framework

Python
36
star
19

slackbot-destroyer

๐Ÿ“ฃ โŒ Slack integration that can destroy all incoming messages from Slackbot.
Python
34
star
20

global-attribution-mapping

GAM (Global Attribution Mapping) explains the landscape of neural network predictions across subpopulations
Python
33
star
21

federated-model-aggregation

The Federated Model Aggregation (FMA) Service is a collection of installable python components that make up the generic workflow/infrastructure needed for federated learning.
Python
30
star
22

oas-nodegen

A library for generating completely customizable code from the Open API Specification (FKA Swagger) RESTful API documentation using the scripting power of Node.js.
JavaScript
28
star
23

easy-screenshots

Android Instrumentation Test Screenshots made Easy.
Java
21
star
24

edgetest

edgetest is a tox-inspired python library that will loop through your project's dependencies, and check if your project is compatible with the latest version of each dependency
Python
19
star
25

ablation

Evaluating XAI methods through ablation studies.
Python
15
star
26

serverless-shell

โšก๏ธ๐Ÿš Serverless Shell with environment variables plugin
JavaScript
14
star
27

OAuthClient

Awesome OAuth Client for Java.
Java
13
star
28

otvPlots

ovtPlots: An R Package for Variable Level Monitoring
R
13
star
29

json-syntax

Generates functions to convert Python classes to and from JSON friendly objects.
Python
12
star
30

screen-object

screen-object (ruby gem for mobile app automation)
Ruby
12
star
31

jwt-security

JavaScript
11
star
32

BankAccountStarter-API-reference-app

CSS
10
star
33

CreditOffers-API-reference-app

JavaScript
10
star
34

Rewards-API-reference-app

JavaScript
10
star
35

local-crontab

๐Ÿ—บ๏ธโฐ Convert local crontabs to UTC crontabs
JavaScript
8
star
36

modtracker

JSON unmarshaling in Go that includes detection of modified fields
Go
7
star
37

grpc-cucumber-js

JavaScript
7
star
38

edgetest-hub

hub plugin for edgetest
Python
2
star
39

oas-nodegen-example

Example project that shows how to customize generated code to fit a specific design pattern using oas-nodegen
Java
2
star
40

edgetest-conda

Conda plugin for edgetest
Python
1
star
41

edgetest-pip-tools

pip-tools plugin for edgetest
Python
1
star