• Stars
    star
    377
  • Rank 109,480 (Top 3 %)
  • Language
    Ruby
  • License
    Other
  • Created over 11 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Extract, Transform, and Load data with Ruby

ETL

Extract, transform, and load data with ruby!

Installation

Add this line to your application's Gemfile:

gem 'ETL'

And then execute:

$ bundle

Or install it yourself as:

$ gem install ETL

ETL Dependencies

ETL depends on having a database connection object that must respond to #query. The mysql2 gem is a good option. You can also proxy another library using Ruby's SimpleDelegator and add a #query method if need be.

The gem comes bundled with a default logger. If you'd like to write your own just make sure that it implements #debug and #info. For more information on what is logged and when, view the logger details.

Basic ETL

Assume that we have a database connection represented by connection.

To run a basic ETL that is composed of sequential SQL statements, start by creating a new ETL instance:

# setting connection at the class level
ETL.connection = connection

etl = ETL.new(description: "a description of what this ETL does")

or

# setting connection at the instance level
etl = ETL.new(
  description: "a description of what this ETL does",
  connection:  connection
)

which can then be configured:

etl.config do |etl|
  etl.ensure_destination do |etl|
    # For most ETLs you may want to ensure that the destination exists, so the
    # #ensure_destination block is ideally suited to fulfill this requirement.
    #
    # By way of example:
    #
    etl.query %[
      CREATE TABLE IF NOT EXISTS some_database.some_destination_table (
          user_id INT UNSIGNED NOT NULL
        , created_date DATE NOT NULL
        , total_amount INT SIGNED NOT NULL
        , message VARCHAR(100) DEFAULT NULL
        , PRIMARY KEY (user_id, created_date)
        , KEY (created_date)
      )
    ]
  end

  etl.before_etl do |etl|
    # All pre-ETL work is performed in this block.
    #
    # This can be thought of as a before-ETL hook that will fire only once. When
    # you are not leveraging the ETL iteration capabilities, the value of this
    # block vs the #etl block is not very clear. We will see how and when to
    # leverage this block effectively when we introduce iteration.
    #
    # As an example, let's say we want to get rid of all entries that have an
    # amount less than zero before moving on to our actual etl:
    #
    etl.query %[DELETE FROM some_database.some_source_table WHERE amount < 0]
  end

  etl.etl do |etl|
    # Here is where the magic happens! This block contains the main ETL
    # operation.
    #
    # For example:
    #
    etl.query %[
      REPLACE INTO some_database.some_destination_table (
          user_id
        , created_date
        , total_amount
      ) SELECT
          user_id
        , DATE(created_at) AS created_date
        , SUM(amount) AS total_amount
      FROM
        some_database.some_source_table sst
      GROUP BY
          sst.user_id
        , DATE(sst.created_at)
    ]
  end

  etl.after_etl do |etl|
    # All post-ETL work is performed in this block.
    #
    # Again, to finish up with an example:
    #
    etl.query %[
      UPDATE some_database.some_destination_table
      SET message = "WOW"
      WHERE total_amount > 100
    ]
  end
end

At this point it is possible to run the ETL instance via:

etl.run

which executes #ensure_destination, #before_etl, #etl, and #after_etl in that order.

ETL with iteration

To add in iteration, simply supply #start, #step, and #stop blocks. This is useful when dealing with large data sets or when executing queries that, while optimized, are still slow.

Again, to kick things off:

etl = ETL.new(
  description: "a description of what this ETL does",
  connection:  connection
)

where connection is the same as described above.

Next we can configure the ETL:

# assuming we have the ETL instance from above
etl.config do |etl|
  etl.ensure_destination do |etl|
    # For most ETLs you may want to ensure that the destination exists, so the
    # #ensure_destination block is ideally suited to fulfill this requirement.
    #
    # By way of example:
    #
    etl.query %[
      CREATE TABLE IF NOT EXISTS some_database.some_destination_table (
          user_id INT UNSIGNED NOT NULL
        , created_date DATE NOT NULL
        , total_amount INT SIGNED NOT NULL
        , message VARCHAR(100) DEFAULT NULL
        , PRIMARY KEY (user_id, created_date)
        , KEY (created_date)
      )
    ]
  end

  etl.before_etl do |etl|
    # All pre-ETL work is performed in this block.
    #
    # Now that we are leveraging iteration the #before_etl block becomes
    # more useful as a way to execute an operation once before we begin
    # our iteration.
    #
    # As an example, let's say we want to get rid of all entries that have an
    # amount less than zero before moving on to our actual etl:
    #
    etl.query %[
      DELETE FROM some_database.some_source_table
      WHERE amount < 0
    ]
  end

  etl.start do |etl|
    # This defines where the ETL should start. This can be a flat number
    # or date, or even SQL / other code can be executed to produce a starting
    # value.
    #
    # Usually, this is the last known entry for the destination table with
    # some sensible default if the destination does not yet contain data.
    #
    # As an example:
    #
    # Note that we cast the default date as a DATE. If we don't, it will be
    # treated as a string and our iterator will fail under the hood when testing
    # if it is complete.
    res = etl.query %[
      SELECT COALESCE(MAX(created_date), DATE('2010-01-01')) AS the_max
      FROM some_database.some_destination_table
    ]

    res.to_a.first['the_max']
  end

  etl.step do |etl|
    # The step block defines the size of the iteration block. To iterate by
    # ten records, the step block should be set to return 10.
    #
    # As an alternative example, to set the iteration to go 10,000 units
    # at a time, the following value should be provided:
    #
    #   10_000 (Note: an underscore is used for readability)
    #
    # As an example, to iterate 7 days at a time:
    #
    7
  end

  etl.stop do |etl|
    # The stop block defines when the iteration should halt.
    # Again, this can be a flat value or code. Either way, one value *must* be
    # returned.
    #
    # As a flat value:
    #
    #   1_000_000
    #
    # Or a date value:
    #
    #   Time.now.to_date
    #
    # Or as a code example:
    #
    res = etl.query %[
      SELECT DATE(MAX(created_at)) AS the_max
      FROM some_database.some_source_table
    ]

    res.to_a.first['the_max']
  end

  etl.etl do |etl, lbound, ubound|
    # The etl block is the main part of the framework. Note: there are
    # two extra args with the iterator this time around: "lbound" and "ubound"
    #
    # "lbound" is the lower bound of the current iteration. When iterating
    # from 0 to 10 and stepping by 2, the lbound would equal 2 on the
    # second iteration.
    #
    # "ubound" is the upper bound of the current iteration. In continuing with the
    # example above, when iterating from 0 to 10 and stepping by 2, the ubound would
    # equal 4 on the second iteration.
    #
    # These args can be used to "window" SQL queries or other code operations.
    #
    # As a first example, to iterate over a set of ids:
    #
    #   etl.query %[
    #     REPLACE INTO some_database.some_destination_table (
    #         created_date
    #       , user_id
    #       , total_amount
    #     ) SELECT
    #         DATE(sst.created_at) AS created_date
    #       , sst.user_id
    #       , SUM(sst.amount) AS total_amount
    #     FROM
    #       some_database.some_source_table sst
    #     WHERE
    #       sst.user_id > #{lbound} AND sst.user_id <= #{ubound}
    #     GROUP BY
    #         DATE(sst.created_at)
    #       , sst.user_id]
    #
    # To "window" a SQL query using dates:
    #
    etl.query %[
      REPLACE INTO some_database.some_destination_table (
          created_date
        , user_id
        , total_amount
      ) SELECT
          DATE(sst.created_at) AS created_date
        , sst.user_id
        , SUM(sst.amount) AS total_amount
      FROM
        some_database.some_source_table sst
      WHERE
        -- Note the usage of quotes surrounding the lbound and ubound vars.
        -- This is is required when dealing with dates / datetimes
        sst.created_at >= '#{lbound}' AND sst.created_at < '#{ubound}'
      GROUP BY
          DATE(sst.created_at)
        , sst.user_id
    ]

    # Note that there is no sql sanitization here so there is *potential* for SQL
    # injection. That being said you'll likely be using this gem in an internal
    # tool so hopefully your co-workers are not looking to sabotage your ETL
    # pipeline. Just be aware of this and handle it as you see fit.
  end

  etl.after_etl do |etl|
    # All post-ETL work is performed in this block.
    #
    # Again, to finish up with an example:
    #
    etl.query %[
      UPDATE some_database.some_destination_table
      SET message = "WOW"
      WHERE total_amount > 100
    ]
  end
end

At this point it is possible to run the ETL instance via:

etl.run

which executes #ensure_destination, #before_etl, #etl, and #after_etl in that order.

Note that #etl executes #start and #stop once and memoizes the result for each. It then begins to iterate from what #start evaluated to up until what #stop evaluated to by what #step evaluates to.

Examples

There are two examples found in ./examples that demonstrate the basic ETL and iteration ETL. Each file uses the mysql2 gem and reads / writes data to localhost using the root user with no password. Adjust as needed.

Logger Details

A logger must support two methods: #info and #warn.

Both methods should accept a single hash argument. The argument will contain:

  • :emitter => a reference to the ETL instance's self
  • :event_type => a symbol that includes the type of event being logged. You can use this value to derive which other data you'll have available

When :event_type is equal to :query_start, you'll have the following available in the hash argument:

  • :sql => the sql that is going to be run

These events are logged at the debug level.

When :event_type is equal to :query_complete, you'll have the following available in the hash argument:

  • :sql => the sql that was run
  • :runtime => how long the query took to execute

These events are logged at the info level.

Following from this you could implement a simple logger as:

class PutsLogger
  def info data
    @data = data
    write!
  end

  def debug data
    @data = data
    write!
  end

private

  def write!
    case (event_type = @data.delete(:event_type))
    when :query_start
      output =  "#{@data[:emitter].description} is about to run\n"
      output += "#{@data[:sql]}\n"
    when :query_complete
      output =  "#{@data[:emitter].description} executed:\n"
      output += "#{@data[:sql]}\n"
      output += "query completed at #{Time.now} and took #{@data[:runtime]}s\n"
    else
      output = "no special logging for #{event_type} event_type yet\n"
    end
    puts output
    @data = nil
  end
end

Contributing

If you would like to contribute code to ETL you can do so through GitHub by forking the repository and sending a pull request.

When submitting code, please make every effort to follow existing conventions and style in order to keep the code as readable as possible.

Before your code can be accepted into the project you must also sign the Individual Contributor License Agreement (CLA).

License

Copyright 2013 Square Inc.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

More Repositories

1

okhttp

Squareโ€™s meticulous HTTP client for the JVM, Android, and GraalVM.
Kotlin
45,250
star
2

retrofit

A type-safe HTTP client for Android and the JVM
HTML
42,581
star
3

leakcanary

A memory leak detection library for Android.
Kotlin
29,130
star
4

picasso

A powerful image downloading and caching library for Android
Kotlin
18,656
star
5

javapoet

A Java API for generating .java source files.
Java
10,691
star
6

moshi

A modern JSON library for Kotlin and Java.
Kotlin
9,502
star
7

okio

A modern I/O library for Android, Java, and Kotlin Multiplatform.
Kotlin
8,667
star
8

dagger

A fast dependency injector for Android and Java.
Java
7,317
star
9

crossfilter

Fast n-dimensional filtering and grouping of records.
JavaScript
6,222
star
10

PonyDebugger

Remote network and data debugging for your native iOS app using Chrome Developer Tools
Objective-C
5,867
star
11

maximum-awesome

Config files for vim and tmux.
Ruby
5,706
star
12

otto

An enhanced Guava-based event bus with emphasis on Android support.
Java
5,174
star
13

cubism

Cubism.js: A JavaScript library for time series visualization.
JavaScript
4,930
star
14

sqlbrite

A lightweight wrapper around SQLiteOpenHelper which introduces reactive stream semantics to SQL operations.
Java
4,574
star
15

android-times-square

Standalone Android widget for picking a single date from a calendar view.
Java
4,437
star
16

wire

gRPC and protocol buffers for Android, Kotlin, Swift and Java.
Kotlin
4,174
star
17

Valet

Valet lets you securely store data in the iOS, tvOS, or macOS Keychain without knowing a thing about how the Keychain works. Itโ€™s easy. We promise.
Swift
3,957
star
18

cube

Cube: A system for time series visualization.
JavaScript
3,912
star
19

kotlinpoet

A Kotlin API for generating .kt source files.
Kotlin
3,805
star
20

java-code-styles

IntelliJ IDEA code style settings for Square's Java and Android projects.
Shell
2,957
star
21

flow

Name UI states, navigate between them, remember where you've been.
Java
2,789
star
22

spoon

Distributing instrumentation tests to all your Androids.
HTML
2,700
star
23

keywhiz

A system for distributing and managing secrets
Java
2,614
star
24

tape

A lightning fast, transactional, file-based FIFO for Android and Java.
Java
2,459
star
25

certstrap

Tools to bootstrap CAs, certificate requests, and signed certificates.
Go
2,194
star
26

mortar

A simple library that makes it easy to pair thin views with dedicated controllers, isolated from most of the vagaries of the Activity life cycle.
Java
2,159
star
27

go-jose

An implementation of JOSE standards (JWE, JWS, JWT) in Go
1,981
star
28

Cleanse

Lightweight Swift Dependency Injection Framework
Swift
1,773
star
29

assertj-android

A set of AssertJ helpers geared toward testing Android.
Java
1,578
star
30

haha

DEPRECATED Java library to automate the analysis of Android heap dumps.
Java
1,436
star
31

phrase

Phrase is an Android string resource templating library
Java
1,403
star
32

cane

Code quality threshold checking as part of your build
Ruby
1,325
star
33

seismic

Android device shake detection.
Java
1,275
star
34

anvil

A Kotlin compiler plugin to make dependency injection with Dagger 2 easier.
Kotlin
1,265
star
35

sudo_pair

Plugin for sudo that requires another human to approve and monitor privileged sudo sessions
Rust
1,230
star
36

spacecommander

Commit fully-formatted Objective-C as a team without even trying.
Objective-C
1,126
star
37

square.github.io

A simple, static portal which outlines our open source offerings.
CSS
1,108
star
38

workflow

A Swift and Kotlin library for making composable state machines, and UIs driven by those state machines.
Shell
1,107
star
39

workflow-kotlin

A Swift and Kotlin library for making composable state machines, and UIs driven by those state machines.
Kotlin
982
star
40

certigo

A utility to examine and validate certificates in a variety of formats
Go
917
star
41

logcat

I CAN HAZ LOGZ?
Kotlin
893
star
42

radiography

Text-ray goggles for your Android UI.
Kotlin
831
star
43

whorlwind

Makes fingerprint encryption a breeze.
Java
819
star
44

dagger-intellij-plugin

An IntelliJ IDEA plugin for Dagger which provides insight into how injections and providers are used.
Java
798
star
45

cycler

Kotlin
793
star
46

Paralayout

Paralayout is a set of simple, useful, and straightforward utilities that enable pixel-perfect layout in iOS. Your designers will love you.
Swift
771
star
47

apropos

A simple way to serve up appropriate images for every visitor.
Ruby
766
star
48

shift

shift is an application that helps you run schema migrations on MySQL databases
Ruby
735
star
49

coordinators

Simple MVWhatever for Android
Java
703
star
50

subzero

Block's Bitcoin Cold Storage solution.
C
667
star
51

Blueprint

Declarative UI construction for iOS, written in Swift
Swift
659
star
52

shuttle

String extraction, translation and export tools for the 21st century. "Moving strings around so you don't have to"
Ruby
657
star
53

gifencoder

A pure Java library implementing the GIF89a specification. Suitable for use on Android.
Java
654
star
54

pollexor

Java client for the Thumbor image service which allows you to build URIs in an expressive fashion using a fluent API.
Java
633
star
55

intro-to-d3

a D3.js tutorial
CSS
603
star
56

kochiku

Shard your builds for fun and profit
Ruby
602
star
57

curtains

Lift the curtain on Android Windows!
Kotlin
570
star
58

RxIdler

An IdlingResource for Espresso which wraps an RxJava Scheduler.
Java
512
star
59

svelte-store

TypeScript
494
star
60

field-kit

FieldKit lets you take control of your text fields.
JavaScript
463
star
61

burst

A unit testing library for varying test data.
Java
462
star
62

SuperDelegate

SuperDelegate provides a clean application delegate interface and protects you from bugs in the application lifecycle
Swift
454
star
63

otto-intellij-plugin

An IntelliJ IDEA plugin to navigate between events posted by Otto.
Java
453
star
64

js-jose

JavaScript library to encrypt/decrypt data in JSON Web Encryption (JWE) format and to sign/verify data in JSON Web Signature (JWS) format. Leverages Browser's native WebCrypto API.
JavaScript
424
star
65

sharkey

Sharkey is a service for managing certificates for use by OpenSSH
Go
390
star
66

connect-api-examples

Code samples demonstrating the functionality of the Square Connect API
JavaScript
381
star
67

fdoc

Documentation format and verification
Ruby
379
star
68

lgtm

Simple object validation for JavaScript.
JavaScript
366
star
69

laravel-hyrule

Object-oriented, composable, fluent API for writing validations in Laravel
PHP
341
star
70

in-app-payments-flutter-plugin

Flutter Plugin for Square In-App Payments SDK
Objective-C
332
star
71

papa

PAPA: Performance of Android Production Applications
Kotlin
331
star
72

pysurvival

Open source package for Survival Analysis modeling
HTML
319
star
73

pylink

Python Library for device debugging/programming via J-Link
Python
317
star
74

workflow-swift

A Swift and Kotlin library for making composable state machines, and UIs driven by those state machines.
Swift
302
star
75

rails-auth

Modular resource-based authentication and authorization for Rails/Rack
Ruby
288
star
76

cocoapods-generate

A CocoaPods plugin that allows you to easily generate a workspace from a podspec.
Ruby
272
star
77

inspect

inspect is a collection of metrics gathering, analysis utilities for various subsystems of linux, mysql and postgres.
Go
267
star
78

Aardvark

Aardvark is a library that makes it dead simple to create actionable bug reports.
Objective-C
257
star
79

jetpack

jet.pack: package your JRuby rack app for Jetty.
Ruby
249
star
80

gradle-dependencies-sorter

A CLI app and Gradle plugin to sort the dependencies in your Gradle build scripts
Kotlin
242
star
81

luhnybin

Shell
232
star
82

auto-value-redacted

An extension for Google's AutoValue that omits redacted fields from toString().
Java
211
star
83

protoparser

Java parser for .proto schema declarations.
Java
209
star
84

squalor

Go SQL utility library
Go
203
star
85

p2

Platypus Platform: Tools for Scalable Deployment
Go
196
star
86

mimecraft

Utility for creating RFC-compliant multipart and form-encoded HTTP request bodies.
Java
195
star
87

Listable

Declarative list views for iOS apps.
Swift
189
star
88

git-fastclone

git clone --recursive on steroids
Ruby
185
star
89

zapp

Continuous Integration for KIF
Objective-C
179
star
90

metrics

Metrics Query Engine
Go
170
star
91

ruby-rrule

RRULE expansion for Ruby
Ruby
168
star
92

quotaservice

The purpose of a quota service is to prevent cascading failures in micro-service environments. The service acts as a traffic cop, slowing down traffic where necessary to prevent overloading services. For this to work, remote procedure calls (RPCs) between services consult the quota service before making a call. The service isnโ€™t strictly for RPCs between services, and can even be used to apply quotas to database calls, for example.
Go
154
star
93

wire-gradle-plugin

A Gradle plugin for generating Java code for your protocol buffer definitions with Wire.
Groovy
153
star
94

goprotowrap

A package-at-a-time wrapper for protoc, for generating Go protobuf code.
Go
148
star
95

beancounter

Utility to audit the balance of Hierarchical Deterministic (HD) wallets. Supports multisig + segwit wallets.
Go
143
star
96

womeng_handbook

Everything you need to start or expand a women in engineering group in your community.
129
star
97

cocoapods-check

A CocoaPods plugin that shows differences between locked and installed Pods
Ruby
126
star
98

rce-agent

gRPC-based Remote Command Execution Agent
Go
125
star
99

spincycle

Automate and expose complex infrastructure tasks to teams and services.
Go
118
star
100

in-app-payments-react-native-plugin

Objective-C
116
star