hydra - the multi-process MongoDB sharded collection copier

License

See the accompanying LICENSE.txt file for licensing terms.

Purpose

This is working reference code that performs a live copy from one MongoDB collection to another, with minimal or no visible impact to your production MongoDB clusters. Keeps the destination up-to-date with changes from the source with a typically small amount of lag.

There are two conditions that must remain true while running the tools in this suite:

mongos's chunk balancer must be disabled using sh.setBalancerState().
The set of source mongod instances (I recommend using secondaries) must remain up. Also, primary sources must remain primaries, and secondary sources must remain secondaries. This prevents dead cursors from interfering with the copy.

This has only been tested on MongoDB 2.2.3 on Ubuntu 12.04. This should work on other Linux platforms but may require work to operate with MongoDB 2.4.x and beyond.

Required Python Packages

To use this software, use pip to install the following packages into your Python environment:

pymongo
gevent version 1.0rc2
- NOTE: Do not use anything older than 1.0rc2. Earlier versions may have stability issues under load.

Usage

copy_collection.py

copy_collection.py copies a MongoDB collection from one MongoDB cluster or standalone mongod instance to another. It does this in three steps:

Creates an initial snapshot of the source collection on the destination cluster/instance.
Copies indexes from source to destination.
Applies oplog entries from source to destination.

Steps #1 and #3 are performed by worker processeses, one for each source you define (more on this below). copy_collection.py routinely records its progress in its state database. After step #1 finishes, steps #2 and #3 can be resume at any time without issue.

Typical usage for copy_collection.py looks like:

copy_collection.py --source source_file.txt --dest mongos_host/database/collection

The file passed to --source must have the following format:

source_database_name.source_collection_name
mongod-instance-1.foo.com
mongod-instance-2.foo.com:27019
mongod-instance-3.foo.com:27018

Alternatively, --source can also accept as a parameter a mongod URL of a form similar to --dest (host[:port]/database/collection).

NOTE: sources need to be mongod instances, and preferably secondaries rather than primaries. I had a difficult time getting sufficient reliability and performance when copying from a mongos instance. However, the destination must be either a mongod instance (for a non-shared MongoDB setup) or a mongos instance (for a sharded setup).

Useful options:

--percent PCT: limits your copy to a percentage of the source's documents; meant to be used with the corresponding --percent option for compare_collections.py
--restart: re-initialize the state database, to restart from the initial snapshot, rather than continuing where we left off
--state-db: specify a path in which to store the state database; this defaults to the current directory

copy_collection.py output

06-04 00:59:27 [INFO:MainProcess   ] using state db /home/user/hydra/test.collection.db
...
06-04 00:59:29 [INFO:shard1.foo.com] 4% | 5000 / 103993 copied | 2215/sec | 0 dupes | 0 exceptions | 0 retries
06-04 00:59:29 [INFO:shard2.foo.com] 3% | 3500 / 105326 copied | 1579/sec | 0 dupes | 0 exceptions | 0 retries
...
06-04 01:06:23 [INFO:shard1.foo.com] done with initial copy
06-04 01:06:23 [INFO:shard2.foo.com] done with initial copy
06-04 01:06:23 [INFO:parent process] building indices
06-04 01:06:23 [INFO:parent process] ensuring index on [(u'_id', 1)] (options = {'name': u'_id_'})
06-04 01:06:23 [INFO:parent process] starting oplog apply
06-04 01:06:23 [INFO:stats         ] OPS APPLIED                                    | WARNINGS
06-04 01:06:23 [INFO:stats         ] total     lag    inserts   removes   updates   | sleeps    exceptions retries
06-04 01:06:26 [INFO:shard1.foo.com] 204        2      0         0         204       | 0         0          0
06-04 01:06:29 [INFO:shard2.foo.com] 214        1      0         0         214       | 0         0          0

Watch out for an excessive number of retries and exceptions. Sleeps are generally OK unless there are an excessive number. Unfortunately, the definition of "excessive" depends on your specific situation.

After copy_collection.py begins applying ops, keep an eye on the lag column, which shows how many seconds behind copy_collection.py's replication is.

compare_collections.py

compare_collections.py compares two collections and is meant to be used with copy_collection.py. The two scripts can run simultaneously, once copy_collection.py is up-to-date with applying ops.

To compensate for small amounts of copy_collection.py lag, compare_collections.py tries the comparison of each document multiple times to check whether the documents eventually match. The number of retries and delay between retries is generous, to compensate for frequently updated documents and lag in copy_collection.py.

compare_collections.py output

06-04 01:23:00 [INFO:shard1.foo.com] 30% | 32000 / 104001 compared | 7659/sec | 1 retries | 0 mismatches
06-04 01:23:00 [INFO:shard2.foo.com] 21% | 22700 / 105831 compared | 5402/sec | 0 retries | 0 mismatches

Retries are OK, but watch out for frequent retries. Those might presage mismatches. The _id's for mismatching documents are written to a file named COLLECTION_mismatches.txt. For example, if your collection name is albums, you'll find any mismatches in albums_mismatches.txt. The mismatches file can be used with the copy_stragglers.py tool that will be discussed below.

copy_stragglers.py

Given the list of _ids in the [collection_name]_mismatches.txt file generated by compare_collections.py, this tool re-copies all documents with the given _ids.

For example, if you had just finished comparing the collection albums and compare_collections.py reported some mismatches, you'd run copy_stragglers.py as follows:

./copy_stragglers.py --source source-mongos.foo.com --dest destination-mongos.foo.com --mismatches-file albums_mismatches.txt

NOTE: Unlike copy_collection.py and compare_collections.py, copy_stragglers.py expects the source to be a mongos instance. This is mainly to keep the code extremely simple.

cluster_cop.py

cluster_cop.py monitors the source MongoDB cluster for configuration changes that can impact copy_collection.py and compare_collections.py. These are:

Chunk balancing must be off throughout the whole migration
Primary mongod instances must remain primaries, secondaries must remain secondaries (this prevents cursors from dying while being used)

dropbox/hydra

dropbox

Reviews

Repository Details