• This repository has been archived on 02/May/2023
  • Stars
    star
    119
  • Rank 297,930 (Top 6 %)
  • Language
    Clojure
  • License
    Eclipse Public Li...
  • Created over 10 years ago
  • Updated almost 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Automate copying data from S3 into Amazon Redshift

Blueshift

Service to watch Amazon S3 and automate the load into Amazon Redshift.

Gravitational Blueshift (Image used under CC Attribution Share-Alike License).

Rationale

Amazon Redshift is a "a fast, fully managed, petabyte-scale data warehouse service" but importing data into it can be a bit tricky: e.g. if you want upsert behaviour you have to implement it yourself with temporary tables, and we've had problems importing across machines into the same tables. Redshift also performs best when bulk importing lots of large files from S3.

Blueshift is a little service(tm) that makes it easy to automate the loading of data from Amazon S3 and into Amazon Redshift. It will periodically check for data files within a designated bucket and, when new files are found, import them. It provides upsert behaviour by default.

Importing to Redshift now requires just the ability to write files to S3.

Using

Configuring

Blueshift requires minimal configuration. It will only monitor a single S3 bucket currently, so the configuration file (ordinarily stored in ./etc/config.edn) looks like this:

{:s3 {:bucket        "blueshift-data"
      :key-pattern   ".*"
      :poll-interval {:seconds 30}}
 :telemetry {:reporters [uswitch.blueshift.telemetry/log-metrics-reporter]}}

The :key-pattern option is used to filter for specific keys (so you can have a single bucket with data from different environments, systems etc.).

Building & Running

The application is written in Clojure, to build the project you'll need to use Leiningen.

If you want to run the application on your computer you can run it directly with Leiningen (providing the path to your configuration file)

$ lein run -- --config ./etc/config.edn

Alternatively, you can build an Uberjar that you can run:

$ lein uberjar
$ java -Dlogback.configurationFile=./etc/logback.xml -jar target/blueshift-0.1.0-standalone.jar --config ./etc/config.edn

The uberjar includes Logback for logging. ./etc/logback.xml.example provides a simple starter configuration file with a console appender.

Using

Once the service is running you can create any number of directories in the S3 bucket. These will be periodically checked for files and, if found, an import triggered. If you wish the contents of the directory to be imported it's necessary for it to contain a file called manifest.edn which is used by Blueshift to know which Redshift cluster to import to and how to interpret the data files.

Your S3 structure could look like this:

  bucket
  β”œβ”€β”€ directory-a
  β”‚Β Β  └── foo
  β”‚Β Β      └── manifest.edn
  β”‚Β Β      └── 0001.tsv
  β”‚Β Β      └── 0002.tsv
  └── directory-b
      └── manifest.edn

and the manifest.edn could look like this:

{:table        "testing"
 :pk-columns   ["foo"]
 :columns      ["foo" "bar"]
 :jdbc-url     "jdbc:postgresql://foo.eu-west-1.redshift.amazonaws.com:5439/db?tcpKeepAlive=true&user=user&password=pwd"
 :options      ["DELIMITER '\\t'" "IGNOREHEADER 1" "ESCAPE" "TRIMBLANKS"]
 :data-pattern ".*tsv$"}

When a manifest and data files are found an import is triggered. Once the import has been successfully committed Blueshift will delete any data files that were imported; the manifest remains ready for new data files to be imported.

It's important that :columns lists all the columns (and only the columns) included within the data file and that they are in the same order. :pk-columns must contain a uniquely identifying primary key to ensure the correct upsert behaviour. :options can be used to override the Redshift copy options used during the load.

Blueshift creates a temporary Amazon Redshift Copy manifest that lists all the data files found as mandatory for importing, this also makes it very efficient when loading lots of files into a highly distributed cluster.

Metrics

Blueshift tracks a few metrics using https://github.com/sjl/metrics-clojure. Currently these are logged to the Slf4j logger.

Starting the app will (eventually) show something like this:

[metrics-logger-reporter-thread-1] INFO user - type=COUNTER, name=uswitch.blueshift.s3.directories-watched.directories, count=0
[metrics-logger-reporter-thread-1] INFO user - type=METER, name=uswitch.blueshift.redshift.redshift-imports.commits, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/second
[metrics-logger-reporter-thread-1] INFO user - type=METER, name=uswitch.blueshift.redshift.redshift-imports.imports, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/second
[metrics-logger-reporter-thread-1] INFO user - type=METER, name=uswitch.blueshift.redshift.redshift-imports.rollbacks, count=0, mean_rate=0.0, m1=0.0, m5=0.0, m15=0.0, rate_unit=events/second

Riemann Metrics

Reporting metrics to Riemann can be achieved using the https://github.com/uswitch/blueshift-riemann-metrics project. To enable support you'll need to build the project:

$ cd blueshift-riemann-metrics
$ lein uberjar

And then change the ./etc/config.edn to reference the riemann reporter:

:telemetry {:reporters [uswitch.blueshift.telemetry/log-metrics-reporter
                        uswitch.blueshift.telemetry.riemann/riemann-metrics-reporter]}

Then, when you've built and run Blueshift, be sure to add the jar to the classpath (the following assumes you're in the blueshift working directory):

$ cp blueshift-riemann-metrics/target/blueshift-riemann-metrics-0.1.0-standalone.jar ./target
$ java -cp "target/*" uswitch.blueshift.main --config ./etc/config.edn

Obviously for a production deployment you'd probably want to automate this with your continuous integration server of choice :)

TODO

  • Add exception handling when cleaning uploaded files from S3
  • Change KeyWatcher to identify when directories are deleted, can exit the watcher process and remove from the list of watched directories. If it's added again later can then just create a new process.
  • Add safety check when processing data files- ensure that the header line of the TSV file matches the contents of manifest.edn

Authors

License

Copyright Β© 2014 uSwitch.com Limited.

Distributed under the Eclipse Public License either version 1.0 or (at your option) any later version.

More Repositories

1

kiam

Integrate AWS IAM with Kubernetes
Go
1,135
star
2

lambada

A more passionate way to write AWS Lambda functions
Clojure
315
star
3

yggdrasil

Envoy Control Plane for Kubernetes Multi-cluster Ingress
Go
189
star
4

bifrost

Safely archive data from Apache Kafka to S3 with no Hadoop dependencies :)
Clojure
100
star
5

vault-creds

Sidecar container for requesting dynamic Vault database secrets
Go
86
star
6

nidhogg

Kubernetes Node taints based on Daemonset Pods
Go
73
star
7

terrafying

A small ruby dsl for terraform.
Ruby
58
star
8

torus-pong

A massive multiplayer online take on the arcade classic, written as a part of Clojure Cup 2013
Clojure
50
star
9

speculate

A library that transforms clojure.spec
Clojure
44
star
10

klint

A 'realtime' kubernetes resource linter
Go
41
star
11

sqs-autoscaler-controller

Kubernetes controller for scaling Deployments based on AWS SQS queue length
Go
31
star
12

serverless-hooks-plugin

A plugin to run arbitrary commands on any lifecycle event in serverless
JavaScript
30
star
13

vault-webhook

Kubernetes Mutating Webhook to inject Vault-Creds Sidecar into pods
Go
26
star
14

baldr-old-dead

Pure and light binary records
Clojure
23
star
15

surtr

AWS Kubernetes Node Terminator
Go
22
star
16

heimdall

Generate PrometheusRule CRDs from Ingress annotations and Go templates
Go
22
star
17

big-replicate

Replicates data between Google Cloud BigQuery projects
Clojure
21
star
18

journald-forwarder

Forward systemd journals to Loggly
Go
21
star
19

ustyle

A living styleguide and pattern library by uSwitch.
Smarty
19
star
20

syslogger

Forwards syslog messages to Kafka
Go
16
star
21

koa-core

🎾 Core libraries and example project of how to use @uswitch/koa libraries
JavaScript
16
star
22

ej

a tool to convert from EDN to JSON
Haskell
16
star
23

opencensus-clojure

Clojure
15
star
24

transducers-workshop

Transducers workshop slides and labs
Clojure
15
star
25

incident-app

Incident Management Slack Bot
Ruby
13
star
26

analytij

Clojure client library to interact with the Google Analytics API
Clojure
12
star
27

bqshift

Export data from Redshift to BigQuery
Go
12
star
28

hermod

The Messenger of the Gods
Go
11
star
29

clj-rad

Clojure wrapper of Netflix Surus Robust Anomaly Detection
Clojure
9
star
30

k8s-podmon

A service to monitor failing jobs and pods
Go
9
star
31

loglet

Forward journald log messages to Kafka
Go
8
star
32

adworj

Clojure library to make it easier to interact with Google AdWords
Clojure
8
star
33

rest-client-logger

Adds logging of RestClient requests to the Rails debug log
Ruby
7
star
34

ustyle-react

uStyle + React for the people.
JavaScript
7
star
35

node-problem-detector

Custom plugins for node-problem-detector used in uSwitch
Shell
7
star
36

ssi-loader

Webpack ssi loader
JavaScript
7
star
37

kf

kafka follow
Go
6
star
38

dagr

runs programs every day (in Norse mythology, Dagr is day personified)
Go
6
star
39

etcd-experiment

An experiment in zero downtime clojure app deployments using etcd
Clojure
6
star
40

koa-access

πŸ’€DEPRECATED πŸ‘ŒA Koa middleware for reporting JSON access logs
JavaScript
6
star
41

vault-tokens

Generates Vault tokens for a User based off their AD groups
Go
5
star
42

ontology

Ruby
4
star
43

fsnotify

Cross-platform file system notifications for Go. https://fsnotify.org
Go
4
star
44

elastic-log-lag

Calculates the log lag on Elasticsearch indexes
Go
3
star
45

terraform-provider-segment

A Terraform provider to manage Segment resources via code.
Go
3
star
46

terrafying-components

Ruby
3
star
47

blueshift-riemann-metrics

Riemann metric publishing for Blueshift
Clojure
3
star
48

kubernetes-google-auth

Go
2
star
49

log4-clj-layout

Clojure
2
star
50

labs-window-functions

Provides a docker environment for playing with window functions in PostgreSQL
Shell
2
star
51

rack-ssi

Rack middleware for SSI processing, based on nginx HttpSsiModule
Ruby
2
star
52

terraform-aws-to-gcp-vpn

Creates infrastructure in AWS and GCP for HA VPNs between the two.
HCL
2
star
53

fads

Clojure
1
star
54

browser-tools

Suite of favelets, chrome extensions, audits and snippets to help test and debug the uSwitch website in the browser.
JavaScript
1
star
55

trustyle

React components with style
TypeScript
1
star
56

hypermq

Hypermedia (AtomPub like) message queue
Clojure
1
star
57

rabbitmq-worker

A small wrapper around the Langohr RabbitMQ client
Clojure
1
star
58

fed-convert

Converts Kubernetes resource files into Federated resources
Go
1
star
59

bdcat

decode baldr files in Go
Go
1
star
60

ontology-ui

JavaScript
1
star
61

stdout-fs

Python
1
star
62

instance-signals

Go
1
star
63

riemann-redis-info

Ruby
1
star
64

baldrcat

Tool to print contents of .baldr archives on S3
Clojure
1
star
65

uswitch-academy

Stuff related to courses, internal training and logistic thereof.
JavaScript
1
star
66

airship-aio-ticketing

composite github action for ticketing
1
star
67

bqstream

Stream newline-delimited JSON into BigQuery from STDIN
Go
1
star
68

ookla-netgauge-server

Docker image definition for Ookla Netgauge Server
1
star
69

dockerfiles-etcd-srv-bootstrap

etcd and helper script to do DNS/SRV discovery
Python
1
star