• Stars
    star
    110
  • Rank 316,770 (Top 7 %)
  • Language
    Ruby
  • License
    MIT License
  • Created over 10 years ago
  • Updated about 9 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

GoogleBig Query ActiveRecord Adapter & API client

BigBroda

GoogleBigQuery ActiveRecord Adapter & standalone API client

Use Cases:

https://developers.google.com/bigquery/what-is-bigquery

BigQuery is fantastic for running ad hoc aggregate queries across a very very large dataset - large web logs, ad analysis, sensor data, sales data... etc. Basically, many kinds of "full table scan" queries. Queries are written in a SQL-style language (you don't have to write custom MapReduce functions).

But!, Bigquery has a constraint to consider before diving in, BQ is append only , that means that you can't update records or delete them.

So, use BigQuery as an OLAP (Online Analytical Processing) service, not as OLTP (Online Transactional Processing). In other words, use BigQuery as a DataWareHouse.

Installation

Add 'bigbroda' to your application's Gemfile or install it yourself as:

$ gem install bigbroda

Rails / ActiveRecord:

This gem supports ActiveRecord 4.0 / 4.1.

Support for 4.2 is on the way!.

Configure GoogleBigQuery:

rails g bigbroda:install

Or generate a file in config/initializers/bigquery.rb with the following contents:

BigBroda::Config.setup do |config|
  config.pass_phrase = ["pass_phrase"]
  config.key_file    = ["key_file"]
  config.scope       = ["scope"]
  config.email       = ["email"]
  config.retries     = [retries]
end

retries indicates the number of times to retry on recoverable errors (no retries if set to one or not present)

Active Record Adapter

Connection

ActiveRecord connection in plain ruby:

ActiveRecord::Base.establish_connection(
  :adapter => 'bigquery',
  :project => "MyBigQueryProject",
  :database => "MyBigTable"
)

In Rails app you can use the :adapter, :project and :database options in your database.yml or use the establish_bq_connection(bq_connection) connection in specific models.

development:
  adapter: sqlite3
  database: db/development.sqlite3
  pool: 5

bigquery:
  database: "dummy_dev"
  adapter: 'bigquery'
  project: 123456
  #database: "dummy_test"

By default if you set the development/production/test BD configuration as a bigquery connection all models are Bigquery, migrations and rake:db operations use the BigQuery migration system.

If you don't want to make all your models BigQuery you can set up specific BQ activeRecord models this way:

class UserLog < ActiveRecord::Base
  establish_bq_connection "bigquery"
end

Then you will have to execute the migration programaticly. like this:

UserMigration.up

or

AddPublishedToUser.change

Quering

The GoogleBigQuery Adapter brings some of the ActiveRecord nicieties out of the box:

User.all
User.first, User.last
User.count
User.find_by(name: "")
User.select("name")
User.select("name").where("name contains ?", "frank")
User.select("name, id").where("name contains ?", "frank").count
User.where("id =? and name= ?", "some-id-1393025921", "Frank")
User.where.not("admin = ?", false)
User.admins.joins(:posts)

Note about Joins:

BigQuery supports two types of JOIN operations:

  • JOIN requires that the right-side table contains less than 8 MB of compressed data.
  • JOIN EACH allows join queries for tables of any size.

BigQuery supports INNER and LEFT OUTER joins. The default is INNER.

see more at: https://developers.google.com/bigquery/query-reference#joins

Creation:

  User.create(name: "frank capra")
  @user  = User.new
  @user.name = "Frank"
  @user.save
User.create([{name: "miki"}, {name: "jara"}])

NOTE: by default the adapter will set Id values as an SecureRandom.hex, and for now all the foreign keys are created as a STRING type

Deletion and edition of single rows:

BigQuery tables are append-only. The query language does not currently support either updating or deleting data. In order to update or delete data, you must delete the table, then recreate the table with new data. Alternatively, you could write a query that modifies the data and specify a new results table.

I would actually recommend creating a new table for each day. Since BigQuery charges by amount of data queried over, this would be most economical for you, rather than having to query over entire massive datasets every time.

By the way - how are you currently collecting your data?

Massive Export / Import of data

Google Bigquery allows to import and export large datasets of data the default formats are JSON and CSV, currently the adapter is only able to export JSON format.

Export

The export can be acomplished very easy from an active record model as:

User.bigquery_export(destination)

where destination should be a valid google cloud store uri. The adapter will manage that , so you only need to pass the file name. Example:

User.bigquery_export("file.json")

the adapter will convert that option to gs://[configured_database]/[file.json]. Just be sure to create the bucket propperly in Cloud Storage panel. Also if you don't pass the file argument you will get an generated uri like: gs://[configured_database]/[table_name].json.

Import

There are two ways to import massive data in bigquery, one is from a file from google cloud store and the second is from multipart Post

From google cloud storage:

User.bigquery_import([an_array_with_paths_to_gs_uris])

From multipart/related post:

PENDING

Migrations:

This adapter has migration support migrations built in, but

class CreateUsers < ActiveRecord::Migration
  def self.up
    create_table :users do |t|
      t.string :name
      t.record :nested_data
      t.references :taggable, :polymorphic => true
      t.boolean :admin
      t.timestamps
    end
  end

  def self.down
    drop_table :users
  end
end

class AddPublishedToUser < ActiveRecord::Migration
  def change
    add_column :users, :published, :boolean, default: true
  end
end

Note:

  • Big query does not provide a way to update columns nor delete, so update_column, or remove_column migration are cancelled with an catched exception.
  • Also the schema_migrations table is not created in DB, is created as a json file in db/schema_migrations.json instead. Be sure to add the file in your version control.

Standalone Client:

Configuration setup:

https://developers.google.com/bigquery/docs/authorization

Configure BigBroda client:

BigBroda::Config.setup do |config|
  config.pass_phrase = "notasecret"
  config.key_file    = /location/to_your/key_file.p12
  config.scope       = "https://www.googleapis.com/auth/bigquery"
  config.email       = "[email protected]"
  config.retries     = 1
end

retries indicates the number of times to retry on recoverable errors (no retries if set to one or not present)

And authorize client:

@auth = BigBroda::Auth.new
@auth.authorize

Then you are ready to go!

Projects

https://developers.google.com/bigquery/docs/reference/v2/projects

BigBroda::Project.list["projects"]

Jobs

https://developers.google.com/bigquery/docs/reference/v2/jobs

Exporting data into multiple files

BigQuery can export up to 1 GB of data per file. If you plan to export more than 1 GB, you can use a wildcard character to instruct BigQuery to export to multiple files.

Note: it may take a while.

  BigBroda::Jobs.export(project_id, dataset_id, table_id, bucket_location)

Query

BigBroda::Jobs.query(@project, {"query"=> "SELECT * FROM [#{@dataset_id}.#{@table_name}] LIMIT 1000" })

Datasets

https://developers.google.com/bigquery/docs/reference/v2/datasets

List:

BigBroda::Dataset.list(@project_id)

Create/Insert:

BigBroda::Dataset.create(@project, {"datasetReference"=> { "datasetId" => @dataset_id }} )

Delete:

BigBroda::Dataset.delete(@project, @dataset_id }} )

Update/Patch:

Updates information in an existing dataset. The update method replaces the entire dataset resource, whereas the patch method only replaces fields that are provided in the submitted dataset resource.

BigBroda::Dataset.update(@project, @dataset_id,
      {"datasetReference"=> {
       "datasetId" =>@dataset_id },
      "description"=> "foobar"} )

Updates information in an existing dataset. The update method replaces the entire dataset resource, whereas the patch method only replaces fields that are provided in the submitted dataset resource. This method supports patch semantics.

BigBroda::Dataset.patch(@project, @dataset_id,
        {"datasetReference"=> {
         "datasetId" =>@dataset_id },
        "description"=> "foobar"} )

Tables

https://developers.google.com/bigquery/docs/reference/v2/tables

Create:

@table_body = {  "tableReference"=> {
                    "projectId"=> @project,
                    "datasetId"=> @dataset_id,
                    "tableId"=> @table_name},
        "schema"=> [fields:
                      {:name=> "name", :type=> "string", :mode => "REQUIRED"},
                      {:name=>  "age", :type=> "integer"},
                      {:name=> "weight", :type=> "float"},
                      {:name=> "is_magic", :type=> "boolean"}
                  ]
      }

BigBroda::Table.create(@project, @dataset_id, @table_body

Update:

    BigBroda::Table.update(@project, @dataset_id, @table_name,
        {"tableReference"=> {
         "projectId" => @project, "datasetId" =>@dataset_id, "tableId"  => @table_name },
        "description"=> "foobar"} )

Delete:

BigBroda::Table.delete(@project, @dataset_id, @table_name )

List:

    BigBroda::Table.list(@project, @dataset_id )

Table Data

https://developers.google.com/bigquery/docs/reference/v2/tabledata

InsertAll

Streaming data into BigQuery is free for an introductory period until January 1st, 2014. After that it will be billed at a flat rate of 1 cent per 10,000 rows inserted. The traditional jobs().insert() method will continue to be free. When choosing which import method to use, check for the one that best matches your use case. Keep using the jobs().insert() endpoint for bulk and big data loading. Switch to the new tabledata().insertAll() endpoint if your use case calls for a constantly and instantly updating stream of data.

@rows =   {"rows"=> [
                      {
                        "insertId"=> Time.now.to_i.to_s,
                        "json"=> {
                          "name"=> "User #{Time.now.to_s}"
                        }
                      }
                    ]}

BigBroda::TableData.create(@project, @name, @table_name , @rows )

List

BigBroda::TableData.list(@project, @dataset_id, @table_name)

Testing:

Install deps

appraisal install

Run rspec suite for versions:

appraisal rails-3 rake spec
appraisal rails-4.0.3 rake spec
appraisal rails-4.1 rake spec
appraisal rails-4.2 rake spec

RESOURCES:

https://developers.google.com/bigquery/loading-data-into-bigquery

https://developers.google.com/bigquery/streaming-data-into-bigquery

Api Explorer:

https://developers.google.com/apis-explorer/#s/bigquery/v2/

Google Big query developer guide

https://developers.google.com/bigquery/docs/developers_guide

Caveats:

Contributing

  1. Fork it
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Add some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create new Pull Request

#TODO:

ActiveRecord:

  • AR migration copy tables to update it (copy to gs:// , delete table, import table from gs://)
  • AR migrate BQ record type
  • Make id and foreign keys types and values configurable
  • Jobs make multipart/related upload

More Repositories

1

Dante

Just another Medium wysiwyg editor clone
JavaScript
2,004
star
2

lazy_high_charts

Make highcharts a la ruby , works in rails 5.X / 4.X / 3.X, and other ruby web frameworks
Ruby
1,048
star
3

dante2

A complete rewrite of dante editor in draft-js
JavaScript
910
star
4

espinita

Audit activerecord models like a boss (and works with rails 4!)
Ruby
220
star
5

chaskiq-newsletters

Newsletter Rails engine
Ruby
127
star
6

rails-ui

A collection of components based on Ruby ViewComponent & Stimulus, heavily inspired by shadcn/ui
HTML
27
star
7

lazy_google_analytics

google analytics api client for the lazy
Ruby
22
star
8

autocontext

An Elixir Ecto utility library that provides ActiveRecord-like callbacks, simplifying the management of database operations. This includes before_save, after_save, create, update, and delete functions.
Elixir
12
star
9

detached-carrot

A port of SimplifiedStarling plugin for push-pop active_record tasks with RabbitMQ
Ruby
11
star
10

acts_as_uploaded

a simple rails plugin for upload_nginx module and amazon s3 service, inspired in attachment_fu, (work in progress but in production in artenlinea.com)
Ruby
8
star
11

osc-puredata-ruby

JavaScript
5
star
12

dante2-tests

dante tests using es6 modules
JavaScript
3
star
13

apidone-client

api done client made in ruby (coding) dojo
Ruby
3
star
14

ws-celluloid

JavaScript
3
star
15

discobolo

ruby worker system for disque
Ruby
3
star
16

needish-gem

this gem provides basic methods to access needish.com api, profile, friends, needs and helps
Ruby
3
star
17

faye-tests

Ruby
2
star
18

go_4_live

Go and ableton live OSC experiment
Go
2
star
19

componium-beat

a set of gui abstractions for PD
Pure Data
2
star
20

sinatra-oa-consumer

sinatra-oauth-consumer
1
star
21

LoadTester

This is a simple package to benchmark http requests, This especially shows you how many requests per second a http server is capable of serving.
Go
1
star
22

code-test

Ruby
1
star
23

mongo_mail_store

Rails plugin to catch outcoming emails and store it in MongoDB
Ruby
1
star
24

nginx-chef

clone of nginx-chef v 2.7.5
Ruby
1
star