• Stars
    star
    377
  • Rank 113,535 (Top 3 %)
  • Language
    Ruby
  • License
    Other
  • Created almost 12 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Extract, Transform, and Load data with Ruby

ETL

Extract, transform, and load data with ruby!

Installation

Add this line to your application's Gemfile:

gem 'ETL'

And then execute:

$ bundle

Or install it yourself as:

$ gem install ETL

ETL Dependencies

ETL depends on having a database connection object that must respond to #query. The mysql2 gem is a good option. You can also proxy another library using Ruby's SimpleDelegator and add a #query method if need be.

The gem comes bundled with a default logger. If you'd like to write your own just make sure that it implements #debug and #info. For more information on what is logged and when, view the logger details.

Basic ETL

Assume that we have a database connection represented by connection.

To run a basic ETL that is composed of sequential SQL statements, start by creating a new ETL instance:

# setting connection at the class level
ETL.connection = connection

etl = ETL.new(description: "a description of what this ETL does")

or

# setting connection at the instance level
etl = ETL.new(
  description: "a description of what this ETL does",
  connection:  connection
)

which can then be configured:

etl.config do |etl|
  etl.ensure_destination do |etl|
    # For most ETLs you may want to ensure that the destination exists, so the
    # #ensure_destination block is ideally suited to fulfill this requirement.
    #
    # By way of example:
    #
    etl.query %[
      CREATE TABLE IF NOT EXISTS some_database.some_destination_table (
          user_id INT UNSIGNED NOT NULL
        , created_date DATE NOT NULL
        , total_amount INT SIGNED NOT NULL
        , message VARCHAR(100) DEFAULT NULL
        , PRIMARY KEY (user_id, created_date)
        , KEY (created_date)
      )
    ]
  end

  etl.before_etl do |etl|
    # All pre-ETL work is performed in this block.
    #
    # This can be thought of as a before-ETL hook that will fire only once. When
    # you are not leveraging the ETL iteration capabilities, the value of this
    # block vs the #etl block is not very clear. We will see how and when to
    # leverage this block effectively when we introduce iteration.
    #
    # As an example, let's say we want to get rid of all entries that have an
    # amount less than zero before moving on to our actual etl:
    #
    etl.query %[DELETE FROM some_database.some_source_table WHERE amount < 0]
  end

  etl.etl do |etl|
    # Here is where the magic happens! This block contains the main ETL
    # operation.
    #
    # For example:
    #
    etl.query %[
      REPLACE INTO some_database.some_destination_table (
          user_id
        , created_date
        , total_amount
      ) SELECT
          user_id
        , DATE(created_at) AS created_date
        , SUM(amount) AS total_amount
      FROM
        some_database.some_source_table sst
      GROUP BY
          sst.user_id
        , DATE(sst.created_at)
    ]
  end

  etl.after_etl do |etl|
    # All post-ETL work is performed in this block.
    #
    # Again, to finish up with an example:
    #
    etl.query %[
      UPDATE some_database.some_destination_table
      SET message = "WOW"
      WHERE total_amount > 100
    ]
  end
end

At this point it is possible to run the ETL instance via:

etl.run

which executes #ensure_destination, #before_etl, #etl, and #after_etl in that order.

ETL with iteration

To add in iteration, simply supply #start, #step, and #stop blocks. This is useful when dealing with large data sets or when executing queries that, while optimized, are still slow.

Again, to kick things off:

etl = ETL.new(
  description: "a description of what this ETL does",
  connection:  connection
)

where connection is the same as described above.

Next we can configure the ETL:

# assuming we have the ETL instance from above
etl.config do |etl|
  etl.ensure_destination do |etl|
    # For most ETLs you may want to ensure that the destination exists, so the
    # #ensure_destination block is ideally suited to fulfill this requirement.
    #
    # By way of example:
    #
    etl.query %[
      CREATE TABLE IF NOT EXISTS some_database.some_destination_table (
          user_id INT UNSIGNED NOT NULL
        , created_date DATE NOT NULL
        , total_amount INT SIGNED NOT NULL
        , message VARCHAR(100) DEFAULT NULL
        , PRIMARY KEY (user_id, created_date)
        , KEY (created_date)
      )
    ]
  end

  etl.before_etl do |etl|
    # All pre-ETL work is performed in this block.
    #
    # Now that we are leveraging iteration the #before_etl block becomes
    # more useful as a way to execute an operation once before we begin
    # our iteration.
    #
    # As an example, let's say we want to get rid of all entries that have an
    # amount less than zero before moving on to our actual etl:
    #
    etl.query %[
      DELETE FROM some_database.some_source_table
      WHERE amount < 0
    ]
  end

  etl.start do |etl|
    # This defines where the ETL should start. This can be a flat number
    # or date, or even SQL / other code can be executed to produce a starting
    # value.
    #
    # Usually, this is the last known entry for the destination table with
    # some sensible default if the destination does not yet contain data.
    #
    # As an example:
    #
    # Note that we cast the default date as a DATE. If we don't, it will be
    # treated as a string and our iterator will fail under the hood when testing
    # if it is complete.
    res = etl.query %[
      SELECT COALESCE(MAX(created_date), DATE('2010-01-01')) AS the_max
      FROM some_database.some_destination_table
    ]

    res.to_a.first['the_max']
  end

  etl.step do |etl|
    # The step block defines the size of the iteration block. To iterate by
    # ten records, the step block should be set to return 10.
    #
    # As an alternative example, to set the iteration to go 10,000 units
    # at a time, the following value should be provided:
    #
    #   10_000 (Note: an underscore is used for readability)
    #
    # As an example, to iterate 7 days at a time:
    #
    7
  end

  etl.stop do |etl|
    # The stop block defines when the iteration should halt.
    # Again, this can be a flat value or code. Either way, one value *must* be
    # returned.
    #
    # As a flat value:
    #
    #   1_000_000
    #
    # Or a date value:
    #
    #   Time.now.to_date
    #
    # Or as a code example:
    #
    res = etl.query %[
      SELECT DATE(MAX(created_at)) AS the_max
      FROM some_database.some_source_table
    ]

    res.to_a.first['the_max']
  end

  etl.etl do |etl, lbound, ubound|
    # The etl block is the main part of the framework. Note: there are
    # two extra args with the iterator this time around: "lbound" and "ubound"
    #
    # "lbound" is the lower bound of the current iteration. When iterating
    # from 0 to 10 and stepping by 2, the lbound would equal 2 on the
    # second iteration.
    #
    # "ubound" is the upper bound of the current iteration. In continuing with the
    # example above, when iterating from 0 to 10 and stepping by 2, the ubound would
    # equal 4 on the second iteration.
    #
    # These args can be used to "window" SQL queries or other code operations.
    #
    # As a first example, to iterate over a set of ids:
    #
    #   etl.query %[
    #     REPLACE INTO some_database.some_destination_table (
    #         created_date
    #       , user_id
    #       , total_amount
    #     ) SELECT
    #         DATE(sst.created_at) AS created_date
    #       , sst.user_id
    #       , SUM(sst.amount) AS total_amount
    #     FROM
    #       some_database.some_source_table sst
    #     WHERE
    #       sst.user_id > #{lbound} AND sst.user_id <= #{ubound}
    #     GROUP BY
    #         DATE(sst.created_at)
    #       , sst.user_id]
    #
    # To "window" a SQL query using dates:
    #
    etl.query %[
      REPLACE INTO some_database.some_destination_table (
          created_date
        , user_id
        , total_amount
      ) SELECT
          DATE(sst.created_at) AS created_date
        , sst.user_id
        , SUM(sst.amount) AS total_amount
      FROM
        some_database.some_source_table sst
      WHERE
        -- Note the usage of quotes surrounding the lbound and ubound vars.
        -- This is is required when dealing with dates / datetimes
        sst.created_at >= '#{lbound}' AND sst.created_at < '#{ubound}'
      GROUP BY
          DATE(sst.created_at)
        , sst.user_id
    ]

    # Note that there is no sql sanitization here so there is *potential* for SQL
    # injection. That being said you'll likely be using this gem in an internal
    # tool so hopefully your co-workers are not looking to sabotage your ETL
    # pipeline. Just be aware of this and handle it as you see fit.
  end

  etl.after_etl do |etl|
    # All post-ETL work is performed in this block.
    #
    # Again, to finish up with an example:
    #
    etl.query %[
      UPDATE some_database.some_destination_table
      SET message = "WOW"
      WHERE total_amount > 100
    ]
  end
end

At this point it is possible to run the ETL instance via:

etl.run

which executes #ensure_destination, #before_etl, #etl, and #after_etl in that order.

Note that #etl executes #start and #stop once and memoizes the result for each. It then begins to iterate from what #start evaluated to up until what #stop evaluated to by what #step evaluates to.

Examples

There are two examples found in ./examples that demonstrate the basic ETL and iteration ETL. Each file uses the mysql2 gem and reads / writes data to localhost using the root user with no password. Adjust as needed.

Logger Details

A logger must support two methods: #info and #warn.

Both methods should accept a single hash argument. The argument will contain:

  • :emitter => a reference to the ETL instance's self
  • :event_type => a symbol that includes the type of event being logged. You can use this value to derive which other data you'll have available

When :event_type is equal to :query_start, you'll have the following available in the hash argument:

  • :sql => the sql that is going to be run

These events are logged at the debug level.

When :event_type is equal to :query_complete, you'll have the following available in the hash argument:

  • :sql => the sql that was run
  • :runtime => how long the query took to execute

These events are logged at the info level.

Following from this you could implement a simple logger as:

class PutsLogger
  def info data
    @data = data
    write!
  end

  def debug data
    @data = data
    write!
  end

private

  def write!
    case (event_type = @data.delete(:event_type))
    when :query_start
      output =  "#{@data[:emitter].description} is about to run\n"
      output += "#{@data[:sql]}\n"
    when :query_complete
      output =  "#{@data[:emitter].description} executed:\n"
      output += "#{@data[:sql]}\n"
      output += "query completed at #{Time.now} and took #{@data[:runtime]}s\n"
    else
      output = "no special logging for #{event_type} event_type yet\n"
    end
    puts output
    @data = nil
  end
end

Contributing

If you would like to contribute code to ETL you can do so through GitHub by forking the repository and sending a pull request.

When submitting code, please make every effort to follow existing conventions and style in order to keep the code as readable as possible.

Before your code can be accepted into the project you must also sign the Individual Contributor License Agreement (CLA).

License

Copyright 2013 Square Inc.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

More Repositories

1

okhttp

Squareโ€™s meticulous HTTP client for the JVM, Android, and GraalVM.
Kotlin
45,794
star
2

retrofit

A type-safe HTTP client for Android and the JVM
HTML
43,053
star
3

leakcanary

A memory leak detection library for Android.
Kotlin
29,383
star
4

picasso

A powerful image downloading and caching library for Android
Kotlin
18,716
star
5

javapoet

A Java API for generating .java source files.
Java
10,820
star
6

moshi

A modern JSON library for Kotlin and Java.
Kotlin
9,756
star
7

okio

A modern I/O library for Android, Java, and Kotlin Multiplatform.
Kotlin
8,790
star
8

dagger

A fast dependency injector for Android and Java.
Java
7,308
star
9

crossfilter

Fast n-dimensional filtering and grouping of records.
JavaScript
6,217
star
10

PonyDebugger

Remote network and data debugging for your native iOS app using Chrome Developer Tools
Objective-C
5,864
star
11

maximum-awesome

Config files for vim and tmux.
Ruby
5,706
star
12

otto

An enhanced Guava-based event bus with emphasis on Android support.
Java
5,163
star
13

cubism

Cubism.js: A JavaScript library for time series visualization.
JavaScript
4,943
star
14

sqlbrite

A lightweight wrapper around SQLiteOpenHelper which introduces reactive stream semantics to SQL operations.
Java
4,570
star
15

android-times-square

Standalone Android widget for picking a single date from a calendar view.
Java
4,445
star
16

wire

gRPC and protocol buffers for Android, Kotlin, Swift and Java.
Kotlin
4,244
star
17

Valet

Valet lets you securely store data in the iOS, tvOS, watchOS, or macOS Keychain without knowing a thing about how the Keychain works. Itโ€™s easy. We promise.
Swift
3,989
star
18

cube

Cube: A system for time series visualization.
JavaScript
3,904
star
19

kotlinpoet

A Kotlin API for generating .kt source files.
Kotlin
3,896
star
20

java-code-styles

IntelliJ IDEA code style settings for Square's Java and Android projects.
Shell
2,957
star
21

flow

Name UI states, navigate between them, remember where you've been.
Java
2,786
star
22

spoon

Distributing instrumentation tests to all your Androids.
HTML
2,702
star
23

keywhiz

A system for distributing and managing secrets
Java
2,619
star
24

tape

A lightning fast, transactional, file-based FIFO for Android and Java.
Java
2,466
star
25

certstrap

Tools to bootstrap CAs, certificate requests, and signed certificates.
Go
2,282
star
26

mortar

A simple library that makes it easy to pair thin views with dedicated controllers, isolated from most of the vagaries of the Activity life cycle.
Java
2,157
star
27

go-jose

An implementation of JOSE standards (JWE, JWS, JWT) in Go
1,975
star
28

Cleanse

Lightweight Swift Dependency Injection Framework
Swift
1,784
star
29

assertj-android

A set of AssertJ helpers geared toward testing Android.
Java
1,577
star
30

haha

DEPRECATED Java library to automate the analysis of Android heap dumps.
Java
1,436
star
31

phrase

Phrase is an Android string resource templating library
Java
1,404
star
32

cane

Code quality threshold checking as part of your build
Ruby
1,325
star
33

anvil

A Kotlin compiler plugin to make dependency injection with Dagger 2 easier.
Kotlin
1,310
star
34

seismic

Android device shake detection.
Java
1,275
star
35

sudo_pair

Plugin for sudo that requires another human to approve and monitor privileged sudo sessions
Rust
1,240
star
36

square.github.io

A simple, static portal which outlines our open source offerings.
CSS
1,153
star
37

spacecommander

Commit fully-formatted Objective-C as a team without even trying.
Objective-C
1,126
star
38

workflow

A Swift and Kotlin library for making composable state machines, and UIs driven by those state machines.
Shell
1,124
star
39

workflow-kotlin

A Swift and Kotlin library for making composable state machines, and UIs driven by those state machines.
Kotlin
1,027
star
40

certigo

A utility to examine and validate certificates in a variety of formats
Go
941
star
41

logcat

I CAN HAZ LOGZ?
Kotlin
895
star
42

radiography

Text-ray goggles for your Android UI.
Kotlin
852
star
43

whorlwind

Makes fingerprint encryption a breeze.
Java
817
star
44

dagger-intellij-plugin

An IntelliJ IDEA plugin for Dagger which provides insight into how injections and providers are used.
Java
796
star
45

cycler

Kotlin
791
star
46

Paralayout

Paralayout is a set of simple, useful, and straightforward utilities that enable pixel-perfect layout in iOS. Your designers will love you.
Swift
786
star
47

apropos

A simple way to serve up appropriate images for every visitor.
Ruby
764
star
48

shift

shift is an application that helps you run schema migrations on MySQL databases
Ruby
735
star
49

coordinators

Simple MVWhatever for Android
Java
702
star
50

subzero

Block's Bitcoin Cold Storage solution.
C
683
star
51

Blueprint

Declarative UI construction for iOS, written in Swift
Swift
672
star
52

shuttle

String extraction, translation and export tools for the 21st century. "Moving strings around so you don't have to"
Ruby
656
star
53

gifencoder

A pure Java library implementing the GIF89a specification. Suitable for use on Android.
Java
654
star
54

pollexor

Java client for the Thumbor image service which allows you to build URIs in an expressive fashion using a fluent API.
Java
633
star
55

intro-to-d3

a D3.js tutorial
CSS
602
star
56

kochiku

Shard your builds for fun and profit
Ruby
599
star
57

curtains

Lift the curtain on Android Windows!
Kotlin
570
star
58

svelte-store

TypeScript
524
star
59

RxIdler

An IdlingResource for Espresso which wraps an RxJava Scheduler.
Java
511
star
60

burst

A unit testing library for varying test data.
Java
464
star
61

field-kit

FieldKit lets you take control of your text fields.
JavaScript
463
star
62

SuperDelegate

SuperDelegate provides a clean application delegate interface and protects you from bugs in the application lifecycle
Swift
454
star
63

otto-intellij-plugin

An IntelliJ IDEA plugin to navigate between events posted by Otto.
Java
451
star
64

js-jose

JavaScript library to encrypt/decrypt data in JSON Web Encryption (JWE) format and to sign/verify data in JSON Web Signature (JWS) format. Leverages Browser's native WebCrypto API.
JavaScript
422
star
65

sharkey

Sharkey is a service for managing certificates for use by OpenSSH
Go
395
star
66

connect-api-examples

Code samples demonstrating the functionality of the Square Connect API
JavaScript
391
star
67

fdoc

Documentation format and verification
Ruby
380
star
68

lgtm

Simple object validation for JavaScript.
JavaScript
370
star
69

papa

PAPA: Performance of Android Production Applications
Kotlin
345
star
70

laravel-hyrule

Object-oriented, composable, fluent API for writing validations in Laravel
PHP
341
star
71

in-app-payments-flutter-plugin

Flutter Plugin for Square In-App Payments SDK
Objective-C
340
star
72

pylink

Python Library for device debugging/programming via J-Link
Python
331
star
73

workflow-swift

A Swift and Kotlin library for making composable state machines, and UIs driven by those state machines.
Swift
326
star
74

pysurvival

Open source package for Survival Analysis modeling
HTML
319
star
75

rails-auth

Modular resource-based authentication and authorization for Rails/Rack
Ruby
291
star
76

cocoapods-generate

A CocoaPods plugin that allows you to easily generate a workspace from a podspec.
Ruby
279
star
77

inspect

inspect is a collection of metrics gathering, analysis utilities for various subsystems of linux, mysql and postgres.
Go
268
star
78

Aardvark

Aardvark is a library that makes it dead simple to create actionable bug reports.
Objective-C
260
star
79

gradle-dependencies-sorter

A CLI app and Gradle plugin to sort the dependencies in your Gradle build scripts
Kotlin
253
star
80

jetpack

jet.pack: package your JRuby rack app for Jetty.
Ruby
248
star
81

luhnybin

Shell
232
star
82

auto-value-redacted

An extension for Google's AutoValue that omits redacted fields from toString().
Java
211
star
83

protoparser

Java parser for .proto schema declarations.
Java
210
star
84

squalor

Go SQL utility library
Go
205
star
85

Listable

Declarative list views for iOS apps.
Swift
200
star
86

p2

Platypus Platform: Tools for Scalable Deployment
Go
196
star
87

mimecraft

Utility for creating RFC-compliant multipart and form-encoded HTTP request bodies.
Java
195
star
88

git-fastclone

git clone --recursive on steroids
Ruby
187
star
89

zapp

Continuous Integration for KIF
Objective-C
179
star
90

metrics

Metrics Query Engine
Go
170
star
91

ruby-rrule

RRULE expansion for Ruby
Ruby
170
star
92

quotaservice

The purpose of a quota service is to prevent cascading failures in micro-service environments. The service acts as a traffic cop, slowing down traffic where necessary to prevent overloading services. For this to work, remote procedure calls (RPCs) between services consult the quota service before making a call. The service isnโ€™t strictly for RPCs between services, and can even be used to apply quotas to database calls, for example.
Go
153
star
93

wire-gradle-plugin

A Gradle plugin for generating Java code for your protocol buffer definitions with Wire.
Groovy
153
star
94

goprotowrap

A package-at-a-time wrapper for protoc, for generating Go protobuf code.
Go
148
star
95

beancounter

Utility to audit the balance of Hierarchical Deterministic (HD) wallets. Supports multisig + segwit wallets.
Go
144
star
96

rce-agent

gRPC-based Remote Command Execution Agent
Go
136
star
97

womeng_handbook

Everything you need to start or expand a women in engineering group in your community.
129
star
98

cocoapods-check

A CocoaPods plugin that shows differences between locked and installed Pods
Ruby
126
star
99

point-of-sale-android-sdk

A simple library for letting Point of Sale take in-store payments for your app using Point of Sale API.
Java
119
star
100

in-app-payments-react-native-plugin

Objective-C
119
star