• Stars
    star
    137
  • Rank 264,701 (Top 6 %)
  • Language
    JavaScript
  • License
    Other
  • Created about 10 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

CouchDB import tool to allow data to be bulk inserted

CouchImport

Build Status npm version

Introduction

When populating CouchDB databases, often the source of the data is initially a CSV or TSV file. couchimport is designed to assist you with importing flat data into CouchDB efficiently. It can be used either as command-line utilities couchimport and couchexport or the underlying functions can be used programmatically:

  • simply pipe the data file to couchimport on the command line.
  • handles tab or comma-separated data.
  • uses Node.js's streams for memory efficiency.
  • plug in a custom function to add your own changes before the data is written.
  • writes the data in bulk for speed.
  • can also read huge JSON files using a streaming JSON parser.
  • allows multiple HTTP writes to happen at once using the --parallelism option.

schematic

Installation

Requirements

  • node.js = npm
  sudo npm install -g couchimport

Configuration

couchimport's configuration parameters can be stored in environment variables or supplied as command line arguments.

The location of CouchDB

Simply set the COUCH_URL environment variable e.g. for a hosted Cloudant database

  export COUCH_URL="https://myusername:[email protected]"

or a local CouchDB installation:

  export COUCH_URL="http://localhost:5984"

IAM Authentication

Alternatively, if you are using IAM authentication with IBM Cloudant, then supply two environment variables:

  • COUCH_URL - the URL of your Cloudant host e.g. https://myhost.cloudant.com (note absence of username and password in URL).
  • IAM_API_KEY - the IAM API KEY e.g. ABC123515-151215.

The name of the database - default "test"

Define the name of the CouchDB database to write to by setting the COUCH_DATABASE environment variable e.g.

  export COUCH_DATABASE="mydatabase"

Transformation function - default nothing

Define the path of a file containing a transformation function e.g.

  export COUCH_TRANSFORM="/home/myuser/transform.js"

The file should:

  • be a JavaScript file
  • export one function that takes a single doc and returns a single object or an array of objects if you need to split a row into multiple docs.

(see examples directory).

Delimiter - default "\t"

The define the column delimiter in the input data e.g.

  export COUCH_DELIMITER=","

Running

Simply pipe the text data into "couchimport":

  cat ~/test.tsv | couchimport

This example downloads public crime data, unzips and imports it:

  curl 'http://data.octo.dc.gov/feeds/crime_incidents/archive/crime_incidents_2013_CSV.zip' > crime.zip
  unzip crime.zip
  export COUCH_DATABASE="crime_2013"
  export COUCH_DELIMITER=","
  ccurl -X PUT /crime_2013
  cat crime_incidents_2013_CSV.csv | couchimport

In the above example we use (ccurl)[https://github.com/glynnbird/ccurl], a command-line utility that uses the same environment variables as couchimport.

Output

The following output is visible on the console when "couchimport" runs:

couchimport
-----------
 url         : "https://****:****@myhost.cloudant.com"
 database    : "test"
 delimiter   : "\t"
 buffer      : 500
 parallelism : 1
 type        : "text"
-----------
  couchimport Written ok:500 - failed: 0 -  (500) +0ms
  couchimport { documents: 500, failed: 0, total: 500, totalfailed: 0 } +0ms
  couchimport Written ok:499 - failed: 0 -  (999) +368ms
  couchimport { documents: 499, failed: 0, total: 999, totalfailed: 0 } +368ms
  couchimport writecomplete { total: 999, totalfailed: 0 } +0ms
  couchimport Import complete +81ms

The configuration, whether default or overriden by environment variables or command line arguments, is shown. This is followed by a line of output for each block of 500 documents written, plus a cumulative total.

Preview mode

If you want to see a preview of the JSON that would be created from your csv/tsv files then add --preview true to your command-line:

    > cat text.txt | couchimport --preview true
    Detected a TAB column delimiter
    { product_id: '1',
      brand: 'Gibson',
      type: 'Electric',
      range: 'ES 330',
      sold: 'FALSE' }

As well as showing a JSON preview, preview mode also attempts to detect the column delimiter character for you.

Importing large JSON documents

If your source document is a GeoJSON text file, couchimport can be used. Let's say your JSON looks like this:

{ "features": [ { "a":1}, {"a":2}] }

and we need to import each feature object into CouchDB as separate documents, then this can be imported using the type="json" argument and specifying the JSON path using the jsonpath argument:

  cat myfile.json | couchimport --database mydb --type json --jsonpath "features.*"

Importing JSON Lines file

If your source document is a JSON Lines text file, couchimport can be used. Let's say your JSON Lines looks like this:

{"a":1}
{"a":2}
{"a":3}
{"a":4}
{"a":5}
{"a":6}
{"a":7}
{"a":8}
{"a":9}

and we need to import each line as a JSON object into CouchDB as separate documents, then this can be imported using the type="jsonl" argument:

  cat myfile.json | couchimport --database mydb --type jsonl

Importing a stream of JSONs

If your source data is a lot of JSON objects meshed or appended together, couchimport can be used. Let's say your file:

{"a":1}{"a":2}  {"a":3}{"a":4}
{"a":5}          {"a":6}
{"a":7}{"a":8}



{"a":9}

and we need to import each JSON objet to CouchDB as separate documents, then this can be imported using the type="jsonl" argument:

  cat myfile.json.blob | couchimport --database mydb --type jsonl

Overwriting existing data

If you are importing data into a CouchDB database that already contains data, and you are supplying a document _id in your source data, then and values of _id will fail to write because CouchDB will report a 409 Document Conflict. If you want your supplied data to supercede existing data then supply --overwrite true/-o true as a command-line option. This will instruct couchimport to fetch the existing documents' current _rev values and inject them into the imported data stream.

Note: Using overwrite mode is slower because an additional API call is required per batch of data imported. USe caution when importing data into a data set that is being changed by another actor at the same time.

Environment variables

  • COUCH_URL - the url of the CouchDB instance (required, or to be supplied on the command line)
  • COUCH_DATABASE - the database to deal with (required, or to be supplied on the command line)
  • COUCH_DELIMITER - the delimiter to use (default '\t', not required)
  • COUCH_TRANSFORM - the path of a transformation function (not required)
  • COUCHIMPORT_META - a json object which will be passed to the transform function (not required)
  • COUCH_BUFFER_SIZE - the number of records written to CouchDB per bulk write (defaults to 500, not required)
  • COUCH_FILETYPE - the type of file being imported, either "text", "json" or "jsonl" (defaults to "text", not required)
  • COUCH_JSON_PATH - the path into the incoming JSON document (only required for COUCH_FILETYPE=json imports)
  • COUCH_PREVIEW - run in preview mode
  • COUCH_IGNORE_FIELDS - a comma-separated list of field names to ignore on import or export e.g. price,url,image
  • COUCH_OVERWRITE - overwrite existing document revisions with supplied data
  • COUCH_PARALLELISM - the maximum number of HTTP requests to have in flight at any one time (default: 1)
  • COUCH_MAX_WPS - the maximum number of write API calls to make per second (rate limiting) (default: 0 - no rate limiting)
  • COUCH_RETRY - whether to retry requests which yield a 429 response (default: false)

Command-line parameters

You can also configure couchimport and couchexport using command-line parameters:

  • --help - show help
  • --version - simply prints the version and exits
  • --url/-u - the url of the CouchDB instance (required, or to be supplied in the environment)
  • --database/--db/-d - the database to deal with (required, or to be supplied in the environment)
  • --delimiter - the delimiter to use (default '\t', not required)
  • --transform - the path of a transformation function (not required)
  • --meta/-m - a json object which will be passed to the transform function (not required)
  • --buffer/-b - the number of records written to CouchDB per bulk write (defaults to 500, not required)
  • --type/-t - the type of file being imported, either "text", "json" or "jsonl" (defaults to "text", not required)
  • --jsonpath/-j - the path into the incoming JSON document (only required for type=json imports)
  • --preview/-p - if 'true', runs in preview mode (default false)
  • --ignorefields/-i - a comma-separated list of fields to ignore input or output (default none)
  • --parallelism - the number of HTTP request to have in flight at any one time (default 1)
  • --maxwps - the maximum number of write API calls to make per second (default 0 - no rate limiting)
  • --overwrite/-o - overwrite existing document revisions with supplied data (default: false)
  • --retry/-r - whether to retry requests which yield a 429 response (default: false)

e.g.

    cat test.csv | couchimport --database  bob --delimiter ","

couchexport

If you have structured data in a CouchDB or Cloudant that has fixed keys and values e.g.

{
    "_id": "badger",
    "_rev": "5-a9283409e3253a0f3e07713f42cd4d40",
    "wiki_page": "http://en.wikipedia.org/wiki/Badger",
    "min_weight": 7,
    "max_weight": 30,
    "min_length": 0.6,
    "max_length": 0.9,
    "latin_name": "Meles meles",
    "class": "mammal",
    "diet": "omnivore",
    "a": true
}

then it can be exported to a CSV like so (note how we set the delimiter):

    couchexport --url http://localhost:5984 --database animaldb --delimiter "," > test.csv

or to a TSV like so (we don't need to specify the delimiter since tab \t is the default):

    couchexport --url http://localhost:5984 --database animaldb > test.tsv

or to a stream of JSON:

    couchexport --url http://localhost:5984 --database animaldb --type jsonl

N.B.

  • design documents are ignored
  • the first non-design document is used to define the headings
  • if subsequent documents have different keys, then unexpected things may happen
  • COUCH_DELIMITER or --delimiter can be used to provide a custom column delimiter (not required when tab-delimited)
  • if your document values contain carriage returns or the column delimiter, then this may not be the tool for you
  • you may supply a JavaScript --transform function to modify the data on its way out

Using programmatically

In your project, add couchimport into the dependencies of your package.json or run npm install couchimport. In your code, require the library with

    var couchimport = require('couchimport');

and your options are set in an object whose keys are the same as the COUCH_* environment variables:

e.g.

   var opts = { delimiter: ",", url: "http://localhost:5984", database: "mydb" };

To import data from a readable stream (rs):

    var rs = process.stdin;
    couchimport.importStream(rs, opts, function(err,data) {
       console.log("done");
    });

To import data from a named file:

    couchimport.importFile("input.txt", opts, function(err,data) {
       console.log("done",err,data);
    });

To export data to a writable stream (ws):

   var ws = process.stdout;
   couchimport.exportStream(ws, opts, function(err, data) {
     console.log("done",err,data);
   });

To export data to a named file:

   couchimport.exportFile("output.txt", opts, function(err, data) {
      console.log("done",err,data);
   });

To preview a file:

    couchimport.previewCSVFile('./hp.csv', opts, function(err, data, delimiter) {
      console.log("done", err, data, delimiter);
    });

To preview a CSV/TSV on a URL:

    couchimport.previewURL('https://myhosting.com/hp.csv', opts, function(err, data) {
      console.log("done", err, data, delimiter);  
    });

Monitoring an import

Both importStream and importFile return an EventEmitter which emits

  • written event on a successful write
  • writeerror event when a complete write operation fails
  • writecomplete event after the last write has finished
  • writefail event when an individual line in the CSV fails to be saved as a doc

e.g.

couchimport.importFile("input.txt", opts, function(err,data) {
  console.log("done",err,data);
}).on("written", function(data) {
  // data = { documents: 500, failed:6, total: 63000, totalfailed: 42}
});

The emitted data is an object containing:

  • documents - the number of documents written in the last batch
  • total - the total number of documents written so far
  • failed - the number of documents failed to write in the last batch
  • totalfailed - the number of documents that failed to write in total

Parallelism & Rate limiting

Using the COUCH_PARALLELISM environment variable or the --parallelism command-line option, couchimport can be configured to write data in multiple parallel operations. If you have the networkbandwidth, this can significantly speed up large data imports e.g.

  cat bigdata.csv | couchimport --database mydb --parallelism 10 --delimiter ","

This can be combined with the COUCH_MAX_WPS/--maxwps parameter to limit the number write API calls dispatched per second to make sure you don't exceed the number writes on a rate-limited service.

More Repositories

1

birdreader

Home-made Google Reader replacement powered by Node.js and Cloudant
JavaScript
159
star
2

countriesgeojson

Countries of the world as GeoJSON
65
star
3

teletext

Hacker news as teletext
HTML
54
star
4

couchwarehouse

Data warehouse for CouchDB
JavaScript
47
star
5

yub

Yubico/Yubikey Client API Library for Node.js
JavaScript
44
star
6

toot

A very simple Mastodon command-line client for posting toots.
JavaScript
42
star
7

usstatesgeojson

US states as GeoJSON
Shell
40
star
8

sqltomango

SQL to Mango (Cloudant Query) JSON converter library
JavaScript
28
star
9

datamaker

Data generator command-line tool and library. Create JSON, CSV, XML data from templates.
JavaScript
27
star
10

couchshell

A simple command-line shell that allows you to interact with CouchDB/Cloudant as if it were a Unix file system
JavaScript
27
star
11

kuuid

Time-sortable UUID - roughly time-sortable unique id generator
JavaScript
23
star
12

qrate

A Node.js queue that provides concurrency and rate-limiting (based on async.queue)
JavaScript
21
star
13

drummer

Offline-first drum machine
JavaScript
20
star
14

couchreplicate

CouchDB and Cloudant replication command-line tool and library
JavaScript
19
star
15

deconflict

CouchDB conflict resolution sample code
JavaScript
17
star
16

smartsponsor

An Ethereum smart contract that allows individuals to collect sponsorship "money" for charity events
HTML
17
star
17

couchbackup

CouchDB backup and restore command-line utility.
JavaScript
14
star
18

ansible-install-kafka

Ansible playbook to install Kafka to one or more Ubuntu server machines
Shell
14
star
19

couchmigrate

CouchDB command-line design document migration tool
JavaScript
13
star
20

proforma

Offline-first form filling app
JavaScript
10
star
21

ansible-install-couchdb

Ansible playbook to install CouchDB 2.0 on Raspberry Pi
9
star
22

couchdiff

CouchDB/Cloudant diff tool - is database A different to database B?
JavaScript
8
star
23

badgescanner

Offline-first, qr-code scanner web app. Stores data in PouchDB.
JavaScript
8
star
24

redis-tools

Redis command-line tools to allow mass export, import and deletion of keys
PHP
8
star
25

md

Offline-first, PouchDB-powered, Markdown word processor app.
JavaScript
8
star
26

volt

A Google Chrome extension that can be used to store login details for websites offline with optional sync.
JavaScript
7
star
27

ccurl

CouchDB command-line tool to allow shortened curl commands without putting username/password in your command-line history
JavaScript
7
star
28

envoy-serverless

OpenWhisk version of Cloudant Envoy
JavaScript
6
star
29

changesreader

CouchDB changes reader
JavaScript
6
star
30

couchsnapshot

CouchDB snapshot utility
JavaScript
6
star
31

nosqlimport

General purpose CSV/TSV importer for NoSQL databases
JavaScript
6
star
32

linkshare

A Google Chrome extension that allows links to be store locally and shared with individuals or teams.
CSS
6
star
33

ukcountiesgeojson

UK counties as GeoJSON
6
star
34

onedbperuser

One Database Per User Cloudant tooling
JavaScript
6
star
35

skyphp

A PHP library to get programme information from Sky's TV platform
PHP
5
star
36

Node.js-Web-Server

A simple Node.js web server
JavaScript
5
star
37

simple-autocomplete-service

Node.js app that uses Redis to provide an autocomplete API on data that is uploaded as text files
CSS
5
star
38

postfdb

A CouchDB-like database backed by FoundationDB
JavaScript
5
star
39

couchdeconflict

Command-line utility to remove conflicts from CouchDB/Cloudant documents
Shell
4
star
40

skynode

A simple Sky television EPG written in Node.js, Express and Jade.
JavaScript
4
star
41

couchfirehose

CouchDB data transfer tool
JavaScript
4
star
42

dynamodbexport

DynamoDB export command-line script.
JavaScript
3
star
43

postdb

A CouchDB-like database that uses PostgreSQL as the storage engine.
JavaScript
3
star
44

gutenberg

Offline-first e-book reader
JavaScript
3
star
45

cachemachine

An configurable cache of outgoing HTTP requests and their responses.
JavaScript
3
star
46

markettrader

Bitcoin market trading game
JavaScript
3
star
47

guitars

Demo app using Cloudant as a faceted search engine
JavaScript
3
star
48

dogfight

Recreation of old BBC Micro Dogfight game
JavaScript
3
star
49

mastodonclient

A minimal Mastodon client
JavaScript
3
star
50

foldinizer

Organise files into year/month folder structure
PHP
3
star
51

couchsnap

Minimal CouchDB snapshotting
JavaScript
3
star
52

autotweet

Automatic twitter command-line client
Python
2
star
53

rss

RSS feed mangler
Vue
2
star
54

xword

Collaborative Cryptic Crossword Solver
JavaScript
2
star
55

askeroids

A variation of the classic game Askeroids written as a Java applet in 2000
Java
2
star
56

audiomark

Audio snippet collector and qr-code sharer
JavaScript
2
star
57

detach

Microservice to remove attachments from Cloudant database and save to Object Storage
JavaScript
2
star
58

proxee

API Proxy providing authentication, access control using CouchDB as a data store
JavaScript
2
star
59

postdblite

Database with CouchDB-like API backed by SQLite
JavaScript
2
star
60

documentdbexport

Export a DocumentDB collection to JSON
JavaScript
1
star
61

dynamodbcopy

DynamoDB copy tool
JavaScript
1
star
62

businesscard

PouchDB demo app (Cordova)
JavaScript
1
star
63

changes

A simple Node.js script to listen to the Cloudant _changes feed
JavaScript
1
star
64

datamakerapp

Electron App to generate CouchDB/Cloudant data
JavaScript
1
star
65

etchasketch

Etcha-a-sketch using IoT
JavaScript
1
star
66

secretsanta

Node.js Secret Santa gift picker
JavaScript
1
star
67

chaise

Static CouchDB dashboard app
Vue
1
star
68

shippingforecastgeojson

Radio 4's Shipping Forecast zones as GeoJSON
1
star
69

logshare-server

Log-sharing utility - sever-side code
JavaScript
1
star
70

traffic

A CouchApp that generates load for demonstrating the CouchDB API
JavaScript
1
star
71

centesimal

A demonstration of centesimal time
JavaScript
1
star
72

crm

A demo Customer Relations Management system built with IBM Cloud Functions and the Cloudant database.
JavaScript
1
star
73

cloudant-timeseries

Cloudant helper library for managing time-series data stored in monthly databases
JavaScript
1
star
74

metrics-collector-visualisation-microservice

A web-based visualisation microservice showing data coming from the Metrics Collector
JavaScript
1
star
75

sitescore

Website scoring system
CSS
1
star
76

blockchain-workshop

Exercises for a Blockchain/Ethereum workshop
1
star
77

ccurllib

Utility library for ccurl
JavaScript
1
star
78

metrics-collector-midi-microservice

Turns incoming data into music
JavaScript
1
star
79

bluemix_datacache

Node.js library to interact with IBM BlueMix's SessionCache service
JavaScript
1
star
80

metrics-collector-aggregation-microservice

Consumer microservice that performs simple count, sum and stats operations on incoming data.
JavaScript
1
star
81

envoy

A CouchDB proxy to enable replication of database subsets
JavaScript
1
star
82

flickr-album-restorer

Utility to convert downloaded Flickr data into pictures in one folder per album
JavaScript
1
star
83

scheduledcloudantbackup

Script to perform a Cloudant backup to Cloud Object Storage
JavaScript
1
star