• Stars
    star
    655
  • Rank 66,171 (Top 2 %)
  • Language
    Ruby
  • License
    MIT License
  • Created almost 12 years ago
  • Updated about 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Upsert on MySQL, PostgreSQL, and SQLite3. Transparently creates functions (UDF) for MySQL and PostgreSQL; on SQLite3, uses INSERT OR IGNORE.

Upsert

Build Status

Make it easy to upsert on traditional RDBMS like MySQL, PostgreSQL, and SQLite3β€”hey look NoSQL!. Transparently creates (and re-uses) stored procedures/functions when necessary.

You pass it a bare-metal connection to the database like Mysql2::Client (from mysql2 gem on MRI) or Java::OrgPostgresqlJdbc4::Jdbc4Connection (from jdbc-postgres on Jruby).

As databases start to natively support SQL MERGE (which is basically upsert), this library will take advantage (but you won't have to change your code).

Does not depend on ActiveRecord.

Does not use INSERT ON DUPLICATE KEY UPDATE on MySQL as this only works if you are very careful about creating unique indexes.

70–90%+ faster than emulating upsert with ActiveRecord.

Supports MRI and JRuby.

Usage

You pass a selector that uniquely identifies a row, whether it exists or not. You also pass a setter, attributes that should be set on that row.

Syntax inspired by mongo-ruby-driver's update method.

Basic

connection = Mysql2::Client.new([...])
table_name = :pets
upsert = Upsert.new connection, table_name
# N times...
upsert.row({:name => 'Jerry'}, :breed => 'beagle', :created_at => Time.now)

The created_at and created_on columns are used for inserts, but ignored on updates.

So just to reiterate you've got a selector and a setter:

selector = { :name => 'Jerry' }
setter = { :breed => 'beagle' }
upsert.row(selector, setter)

Batch mode

By organizing your upserts into a batch, we can do work behind the scenes to make them faster.

connection = Mysql2::Client.new([...])
Upsert.batch(connection, :pets) do |upsert|
  # N times...
  upsert.row({:name => 'Jerry'}, :breed => 'beagle')
  upsert.row({:name => 'Pierre'}, :breed => 'tabby')
end

Batch mode is tested to be about 80% faster on PostgreSQL, MySQL, and SQLite3 than other ways to emulate upsert (see the tests, which fail if they are not faster).

Native Postgres upsert

INSERT ... ON CONFLICT DO UPDATE is used when Postgres 9.5+ is detected and unique constraint are in place.

**Note: ** You must have a unique constraint on the column(s) you're using as a selector. A unique index won't work. See #98 (comment) for more information and some ways to check.

If you don't have unique constraints, it will fall back to the classic Upsert gem user-defined function, which does not require a constraint.

ActiveRecord helper method

require 'upsert/active_record_upsert'
# N times...
Pet.upsert({:name => 'Jerry'}, :breed => 'beagle')

Wishlist

Pull requests for any of these would be greatly appreciated:

  1. Cache JDBC PreparedStatement objects.
  2. Sanity check my three benchmarks (four if you include activerecord-import on MySQL). Do they accurately represent optimized alternatives?
  3. Provide require 'upsert/debug' that will make sure you are selecting on columns that have unique indexes
  4. Test that Upsert instances accept arbitrary columns, even within a batch, which is what people probably expect.
  5. @antage's idea for "true" upserting: (from #17)
selector = { id: 15 }
update_setter = { count: Upsert.sql('count + 1') }
insert_setter = { count: 1 }
upsert.row_with_two_setter(update_setter, insert_setter, selector)

Real-world usage

Faraday logo

We use upsert for big data at Faraday. Originally written to speed up the data_miner data mining library.

Supported databases/drivers

* MySQL PostgreSQL SQLite3
MRI mysql2 pg sqlite3
JRuby jdbc-mysql jdbc-postgres jdbc-sqlite3

See below for details about what SQL MERGE trick (emulation of upsert) is used, performance, code examples, etc.

Rails / ActiveRecord

(Assuming that one of the other three supported drivers is being used under the covers).

  • add "upsert" to your Gemfile and
  • run bundle install
Upsert.new Pet.connection, Pet.table_name

Speed

Depends on the driver being used!

SQL MERGE trick

Depends on the driver being used!

MySQL

On MRI, use the mysql2 driver.

require 'mysql2'
connection = Mysql2::Connection.new(:username => 'root', :password => 'password', :database => 'upsert_test')
table_name = :pets
upsert = Upsert.new(connection, table_name)

On JRuby, use the jdbc-mysql driver.

require 'jdbc/mysql'
java.sql.DriverManager.register_driver com.mysql.jdbc.Driver.new
connection = java.sql.DriverManager.get_connection "jdbc:mysql://127.0.0.1/mydatabase?user=root&password=password"

Speed

From the tests (updated 11/7/12):

Upsert was 82% faster than find + new/set/save
Upsert was 85% faster than find_or_create + update_attributes
Upsert was 90% faster than create + rescue/find/update
Upsert was 46% faster than faking upserts with activerecord-import (note: in question as of 3/13/15, need some expert advice)

SQL MERGE trick

Thanks to Dennis Hennen's StackOverflow response!!

CREATE PROCEDURE upsert_pets_SEL_name_A_tag_number_SET_name_A_tag_number(`name_sel` varchar(255), `tag_number_sel` int(11), `name_set` varchar(255), `tag_number_set` int(11))
BEGIN
  DECLARE done BOOLEAN;
  REPEAT
    BEGIN
      -- If there is a unique key constraint error then
      -- someone made a concurrent insert. Reset the sentinel
      -- and try again.
      DECLARE ER_DUP_UNIQUE CONDITION FOR 23000;
      DECLARE ER_INTEG CONDITION FOR 1062;
      DECLARE CONTINUE HANDLER FOR ER_DUP_UNIQUE BEGIN
        SET done = FALSE;
      END;

      DECLARE CONTINUE HANDLER FOR ER_INTEG BEGIN
        SET done = TRUE;
      END;

      SET done = TRUE;
      SELECT COUNT(*) INTO @count FROM `pets` WHERE `name` = `name_sel` AND `tag_number` = `tag_number_sel`;
      -- Race condition here. If a concurrent INSERT is made after
      -- the SELECT but before the INSERT below we'll get a duplicate
      -- key error. But the handler above will take care of that.
      IF @count > 0 THEN
        -- UPDATE table_name SET b = b_SET WHERE a = a_SEL;
        UPDATE `pets` SET `name` = `name_set`, `tag_number` = `tag_number_set` WHERE `name` = `name_sel` AND `tag_number` = `tag_number_sel`;
      ELSE
        -- INSERT INTO table_name (a, b) VALUES (k, data);
        INSERT INTO `pets` (`name`, `tag_number`) VALUES (`name_set`, `tag_number_set`);
      END IF;
    END;
  UNTIL done END REPEAT;
END

PostgreSQL

On MRI, use the pg driver.

require 'pg'
connection = PG.connect(:dbname => 'upsert_test')
table_name = :pets
upsert = Upsert.new(connection, table_name)

On JRuby, use the jdbc-postgres driver.

require 'jdbc/postgres'
java.sql.DriverManager.register_driver org.postgresql.Driver.new
connection = java.sql.DriverManager.get_connection "jdbc:postgresql://127.0.0.1/mydatabase?user=root&password=password"

If you want to use HStore, make the pg-hstore gem available and pass a Hash in setters:

gem 'pg-hstore'
require 'pg_hstore'
upsert.row({:name => 'Bill'}, :mydata => {:a => 1, :b => 2})

PostgreSQL notes

  • Upsert doesn't do any type casting, so if you attempt to do something like the following: upsert.row({ :name => 'A Name' }, :tag_number => 'bob') you'll get an error which reads something like: invalid input syntax for integer: "bob"

Speed

From the tests (updated 9/21/12):

Upsert was 72% faster than find + new/set/save
Upsert was 79% faster than find_or_create + update_attributes
Upsert was 83% faster than create + rescue/find/update
# (can't compare to activerecord-import because you can't fake it on pg)

SQL MERGE trick

Adapted from the canonical PostgreSQL upsert example:

CREATE OR REPLACE FUNCTION upsert_pets_SEL_name_A_tag_number_SET_name_A_tag_number("name_sel" character varying(255), "tag_number_sel" integer, "name_set" character varying(255), "tag_number_set" integer) RETURNS VOID AS
$$
DECLARE
  first_try INTEGER := 1;
BEGIN
  LOOP
    -- first try to update the key
    UPDATE "pets" SET "name" = "name_set", "tag_number" = "tag_number_set"
      WHERE "name" = "name_sel" AND "tag_number" = "tag_number_sel";
    IF found THEN
      RETURN;
    END IF;
    -- not there, so try to insert the key
    -- if someone else inserts the same key concurrently,
    -- we could get a unique-key failure
    BEGIN
      INSERT INTO "pets"("name", "tag_number") VALUES ("name_set", "tag_number_set");
      RETURN;
    EXCEPTION WHEN unique_violation THEN
      -- seamusabshere 9/20/12 only retry once
      IF (first_try = 1) THEN
        first_try := 0;
      ELSE
        RETURN;
      END IF;
      -- Do nothing, and loop to try the UPDATE again.
    END;
  END LOOP;
END;
$$
LANGUAGE plpgsql;

I slightly modified it so that it only retries once - don't want infinite loops.

Sqlite

On MRI, use the sqlite3 driver.

require 'sqlite3'
connection = SQLite3::Database.open(':memory:')
table_name = :pets
upsert = Upsert.new(connection, table_name)

On JRuby, use the jdbc-sqlite3 driver.

# TODO somebody please verify
require 'jdbc/sqlite3'
java.sql.DriverManager.register_driver org.sqlite.Driver.new
connection = java.sql.DriverManager.get_connection "jdbc:sqlite://127.0.0.1/mydatabase?user=root&password=password"

Speed

From the tests (updated 9/21/12):

Upsert was 77% faster than find + new/set/save
Upsert was 80% faster than find_or_create + update_attributes
Upsert was 85% faster than create + rescue/find/update
# (can't compare to activerecord-import because you can't fake it on sqlite3)

SQL MERGE trick

Thanks to @dan04's answer on StackOverflow:

Please note! This will only work properly on Sqlite if one of the columns being used as the "selector" are a primary key or unique index

INSERT OR IGNORE INTO visits VALUES (127.0.0.1, 1);
UPDATE visits SET visits = 1 WHERE ip LIKE 127.0.0.1;

Features

Tested to be fast and portable

In addition to correctness, the library's tests check that it is

  1. Faster than comparable upsert techniques
  2. Compatible with supported databases

Not dependent on ActiveRecord

As below, all you need is a raw database connection like a Mysql2::Connection, PG::Connection or a SQLite3::Database. These are equivalent:

# with activerecord
Upsert.new ActiveRecord::Base.connection, :pets
# with activerecord, prettier
Upsert.new Pet.connection, Pet.table_name
# without activerecord
Upsert.new Mysql2::Connection.new([...]), :pets

For a specific use case, faster and more portable than activerecord-import

You could also use activerecord-import to upsert:

Pet.import columns, all_values, :timestamps => false, :on_duplicate_key_update => columns

activerecord-import, however, only works on MySQL and requires ActiveRecordβ€”and if all you are doing is upserts, upsert is tested to be 40% faster. And you don't have to put all of the rows to be upserted into a single huge array - you can batch them using Upsert.batch.

Gotchas

No automatic typecasting beyond what the adapter/driver provides

We don't have any logic to convert integers into strings, strings into integers, etc. in order to satisfy PostgreSQL/etc.'s strictness on this issue.

So if you try to upsert a blank string ('') into an integer field in PostgreSQL, you will get an error.

Dates and times are converted to UTC

Datetimes are immediately converted to UTC and sent to the database as ISO8601 strings.

If you're using MySQL, make sure server/connection timezone is UTC. If you're using Rails and/or ActiveRecord, you might want to check ActiveRecord::Base.default_timezone... it should probably be :utc.

In general, run some upserts and make sure datetimes get persisted like you expect.

Clearning all library-generated functions

Place the following in to a rake task (so you don't globally redefine the NAME_PREFIX constant)

Upsert::MergeFunction::NAME_PREFIX = "upsert"

# ActiveRecord
Upsert.clear_database_functions(ActiveRecord::Base.connection)

# Sequel
DB.synchronize do |conn|
  Upsert.clear_database_functions(conn)
end

Doesn't work with transactional fixtures

Per #23 you might have issues if you try to use transactional fixtures and this library.

Testmetrics - https://www.testmetrics.app/seamusabshere/upsert

Copyright

Copyright 2013-2019 Seamus Abshere Copyright 2017-2019 Philip Schalm Portions Copyright (c) 2019 The JRuby Team

More Repositories

1

fuzzy_match

Find a needle (a document or record) in a haystack using string similarity and (optionally) regular expression rules. Uses Dice's Coefficient (aka Pair Similiarity) and Levenshtein Distance internally.
Ruby
667
star
2

data_miner

Download, unpack from a ZIP/TAR/GZ/BZ2 archive, parse, correct, convert units and import Google Spreadsheets, XLS, ODS, XML, CSV, HTML, etc. into your ActiveRecord models. Uses RemoteTable gem internally.
Ruby
301
star
3

remote_table

Open local or remote XLSX, XLS, ODS, CSV (comma separated), TSV (tab separated), other delimited, fixed-width files, and Google Docs. Returns an enumerator of Arrays or Hashes, depending on whether there are headers.
HTML
226
star
4

unix_utils

Like FileUtils, but provides zip, unzip, bzip2, bunzip2, tar, untar, sed, du, md5sum, shasum, cut, head, tail, wc, unix2dos, dos2unix, iconv, curl, perl, etc.
Ruby
226
star
5

cache_method

Cache based on arguments AND object state; store in memcached, redis, or in-process. Like alias_method, but it's cache_method! One step beyond memoization.
Ruby
135
star
6

lock_and_cache

Most caching libraries don't do locking, meaning that >1 process can be calculating a cached value at the same time. Since you presumably cache things because they cost CPU, database reads, or money, doesn't it make sense to lock while caching?
Ruby
134
star
7

mysql2xxxx

Gives you binaries like mysql2csv, mysql2json, and mysql2xml, and Ruby classes to match.
Ruby
84
star
8

cache

Defines a simple interface to multiple cache-like storage engines by wrapping common Ruby client libraries like memcached, redis, memcache-client, dalli. Handles each underlying library's weirdnesses, including forking.
Ruby
69
star
9

eat

A (better?) replacement for open-uri. Lets you open local and remote files by immediately returning their contents as a string.
Ruby
32
star
10

to_regexp

Provides String#to_regexp
Ruby
27
star
11

errata

Define an errata in table format (CSV) and then apply it to an arbitrary source. Inspired by RFC Errata, lets you keep your own errata in a transparent way.
Ruby
21
star
12

cacheable

DEPRECATED. Use cache_method instead.
Ruby
20
star
13

py-upsert

Python library to make it easy to upsert on MySQL, PostgreSQL, and SQLite3.
Python
18
star
14

report

DSL for creating clean CSV, XLSX, and PDF reports in Ruby. Uses xlsx_writer, prawn and pdftk internally.
Ruby
16
star
15

database_url

Convert back and forth between Heroku-style ENV['DATABASE_URL'] and Rails/ActiveRecord-style config/database.yml hashes.
Ruby
16
star
16

lock_method

Like alias_method, but it's lock_method! (lockfiles)
Ruby
12
star
17

common_name

Helps you stop using chains of humanize/downcase/underscore/pluralize/to_sym/etc everywhere in your models, your views, your controllers, etc.
Ruby
11
star
18

engineyard-metadata

Presents a simple, unchanging interface to get metadata about your EngineYard AppCloud instances running on Amazon EC2.
Ruby
10
star
19

cohort_analysis

TBD
Ruby
10
star
20

create_table

Analyze and inspect CREATE TABLE SQL statements and translate across databases. Uses Ragel internally for parsing.
Ruby
10
star
21

ruby_ragel_examples

Examples of using ragel and ruby together
Ruby
9
star
22

fuzzy_infer

Fuzzy set analysis - predicts one or more unknown characteristics of an input case by comparing its known characteristics to a reference dataset whose records contain both the known and unknown characteristics.
Ruby
8
star
23

the_geom_geojson

For PostGIS/PostgreSQL and ActiveRecord, provides "the_geom_geojson" getter and setter that update "the_geom" and "the_geom_webmercator" columns.
Ruby
8
star
24

validates_decency_of

Rails plugin that uses George Carlin's list of seven dirty words (aka swear words, aka cuss words, aka bad words) to check for "decency" on ActiveRecord model attributes.
Ruby
6
star
25

weighted_average

Aircraft.average(:seats) versus Aircraft.weighted_average(:seats, :weighted_by => :takeoffs)
Ruby
6
star
26

loose_tight_dictionary

DEPRECATED: use fuzzy_match. Find a needle in a haystack using string similarity and (optionally) regexp rules.
Ruby
6
star
27

hash_digest

Generates non-cryptographic digests of Hashes (and Arrays) indifferent to key type (string or symbol) and ordering.
Ruby
5
star
28

redirect_routing

Ruby
5
star
29

pg_trgm

Ruby trigram similarity that is identical to Postgres's (almost)
Ruby
5
star
30

ey_cloud_awareness

DEPRECATED: use engineyard-metadata. Make your EngineYard cloud instances aware of each other.
Ruby
4
star
31

xml_split

Split XML files on an element, yielding (streaming, so constant memory usage) each node in turn. Uses sgrep2 internally; future versions should use a pure-Ruby SAX parser.
Ruby
3
star
32

characterizable

DEPRECATED. Use charisma instead.
Ruby
3
star
33

has_handle_fallback

Make it easy to use handles (callsigns/monikers/usernames) in URLs, even if they might be blank.
Ruby
3
star
34

table_warnings

Warn yourself of problems with your ActiveRecord tables.
Ruby
3
star
35

to_json_fix

TODO: one-line summary of your gem
Ruby
3
star
36

honeypot

TODO: one-line summary of your gem
Ruby
3
star
37

cohort_scope

DEPRECATED. Use cohort_analysis. Provides cohorts (in the form of ActiveRecord scopes) that dynamically widen until they contain a certain number of records.
Ruby
3
star
38

vector_embed

Vector embedding of strings, booleans, numerics, and arrays into LIBSVM / LIBLINEAR format.
Ruby
3
star
39

switches

Turn on and off parts of your code based on yaml files.
Ruby
3
star
40

flights1percent

1% flights
JavaScript
2
star
41

zip5

Convert United States zip codes to their correct Zip5 representation, even if they're missing a leading zero and/or they have the +4 suffix.
Ruby
2
star
42

zmq

Drop-in replacement for zmq gem with included binaries
Ruby
2
star
43

cvg

Like jq or grep for csv. Combine one or more CSVs while filtering on fields with regular expressions, whitelists, presence, missing, etc.
Ruby
2
star
44

json_to_csv_to_json

csv_to_json and json_to_csv
Ruby
2
star
45

has_timestamps

Rails plugin to add named timestamps to ActiveRecord models.
Ruby
2
star
46

string_enumerator

Given a string containing placeholders (like [color]), enumerate all of the possible strings resulting from filling those placeholders with replacements (like red, blue).
Ruby
2
star
47

nonrandomapp

Ruby
1
star
48

mini_record

mini_record-compat is DEPRECATED. Use original mini_record OR active_record_inline_schema instead.
Ruby
1
star
49

string_replacer

DEPRECATED/POINTLESS - use sed or augeas. Replace text in a file without disturbing the rest of the file.
Ruby
1
star
50

geocode_records

As long as you do very specific things... quickly re-geocode tables.
Ruby
1
star
51

force_schema

[DEPRECATED - use mini_record] Declare a table structure like an ActiveRecord migration and run 'force_schema!' whenever you want. For when you don't need up and down migrations.
Ruby
1
star
52

fast_timestamp

Rapidly and arbitrarily timestamp ActiveRecord records.
Ruby
1
star