Scalable REST API wrapper for the Caffe deep learning framework.
The problem
Caffe is an awesome deep learning framework, but running it on a single laptop or desktop computer isn't nearly as productive as running it in the cloud at scale.
ElasticThought gives you the ability to:
- Run multiple Caffe training jobs in parallel
- Queue up training jobs
- Tune the number of workers that process jobs on the queue
- Interact with it via a REST API (and later build Web/Mobile apps on top of it)
- Multi-tenancy to allow multiple users to interact with it, each having access to only their own data
Components
- Caffe - core deep learning framework
- Couchbase Server - Distributed document database used as an object store (source code)
- Sync Gateway - REST adapter layer for Couchbase Server + Mobile Sync gateway
- CBFS - Couchbase Distributed File System used as blob store
- NSQ - Distributed message queue
- ElasticThought REST Service - REST API server written in Go
Deployment Architecture
Here is what a typical cluster might look like:
If running on AWS, each CoreOS instance would be running on its own EC2 instance.
Although not shown, all components would be running inside of Docker containers.
It would be possible to start more nodes which only had Caffe GPU workers running.
Roadmap
Current Status: everything under heavy construction, not ready for public consumption yet
- [done] Working end-to-end with IMAGE_DATA caffe layer using a single test set with a single training set, and ability to query trained set.
- [done] Support LEVELDB / LMDB data formats, to run mnist example.
- [in progress] Package everything up to make it easy to deploy locally or in the cloud
- Support the majority of caffe use cases
- Ability to auto-scale worker instances up and down based on how many jobs are in the message queue.
- Attempt to add support for other deep learning frameworks: pylearn2, cuda-convnet, etc.
- Build a Web App on top of the REST API that leverages PouchDB
- Build Android and iOS mobile apps on top of the REST API that leverages Couchbase Mobile
Design goals
- 100% Open Source (Apache 2 / BSD), including all components used.
- Architected to enable warehouse scale computing
- No IAAS lockin -- easily migrate between AWS, GCE, or your own private data center
- Ability to scale down as well as up
Documentation
System Requirements
ElasticThought requires CoreOS to run.
If you want to access the GPU, you will need to do extra work to get CoreOS working with Nvidia CUDA GPU Drivers
Installing elastic-thought on a single CoreOS host (Development mode)
If you are on OSX, you'll first need to install Vagrant, VirtualBox, and CoreOS. See CoreOS on Vagrant for instructions.
Here's what will be created:
┌─────────────────────────────────────────────────────────┐
│ CoreOS Host │
│ ┌──────────────────────────┐ ┌─────────────────────┐ │
│ │ Docker Container │ │ Docker Container │ │
│ │ ┌───────────────────┐ │ │ ┌────────────┐ │ │
│ │ │ Elastic Thought │ │ │ │Sync Gateway│ │ │
│ │ │ Server │ │ │ │ Database │ │ │
│ │ │ ┌───────────┐ │ │ │ │ │ │ │
│ │ │ │In-process │ │◀─┼──┼───▶│ │ │ │
│ │ │ │ Caffe │ │ │ │ │ │ │ │
│ │ │ │ worker │ │ │ │ │ │ │ │
│ │ │ └───────────┘ │ │ │ └────────────┘ │ │
│ │ └───────────────────┘ │ └─────────────────────┘ │
│ └──────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
Run the following commands on your CoreOS box (to get in, you may need to vagrant ssh core-01
)
Start Sync Gateway Database
$ docker run -d --name sync-gateway -P couchbase/sync-gateway:1.1.0-forestdb_bucket sync_gateway https://gist.githubusercontent.com/tleyden/8051567cf62dfa8f89ca/raw/43d4abc9ef64cef7b4bbbdf6cb8ce80c456efd1f/gistfile1.txt
Start ElasticThought REST API server
$ docker run -d --name elastic-thought -p 8080:8080 --link sync-gateway:sync-gateway tleyden5iwx/elastic-thought-cpu-develop bash -c 'refresh-elastic-thought; elastic-thought --sync-gw http://sync-gateway:4984/elastic-thought'
It's also a good idea to check the logs of both containers to look for any errors:
$ docker logs sync-gateway
$ docker logs -f elastic-thought
At this point you can test the API via curl.
Installing elastic-thought on AWS (Production mode)
It should be possible to install elastic-thought anywhere that CoreOS is supported. Currently, there are instructions for AWS and Vagrant (below).
Launch EC2 instances via CloudFormation script
Note: the instance will launch in us-east-1. If you want to launch in another region, please file an issue.
- Launch CPU Stack or Launch GPU Stack
- Choose 3 node cluster with m3.medium or g2.2xlarge (GPU case) instance type
- All other values should be default
Verify CoreOS cluster
Run:
$ fleetctl list-machines
Which should show all the CoreOS machines in your cluster. (this uses etcd under the hood, so will also validate that etcd is setup correctly).
Kick off ElasticThought
Ssh into one of the machines (doesn't matter which): ssh -A [email protected]
$ wget https://raw.githubusercontent.com/tleyden/elastic-thought/master/docker/scripts/elasticthought-cluster-init.sh
$ chmod +x elasticthought-cluster-init.sh
$ ./elasticthought-cluster-init.sh -v 3.0.1 -n 3 -u "user:passw0rd" -p gpu
Once it launches, verify your cluster by running fleetctl list-units
.
It should look like this:
UNIT MACHINE ACTIVE SUB
[email protected] 2340c553.../10.225.17.229 active running
[email protected] fbd4562e.../10.182.197.145 active running
[email protected] 0f5e2e11.../10.168.212.210 active running
[email protected] 2340c553.../10.225.17.229 active running
[email protected] fbd4562e.../10.182.197.145 active running
[email protected] 0f5e2e11.../10.168.212.210 active running
couchbase_bootstrap_node.service 0f5e2e11.../10.168.212.210 active running
couchbase_bootstrap_node_announce.service 0f5e2e11.../10.168.212.210 active running
couchbase_node.1.service 2340c553.../10.225.17.229 active running
couchbase_node.2.service fbd4562e.../10.182.197.145 active running
[email protected] 2340c553.../10.225.17.229 active running
[email protected] fbd4562e.../10.182.197.145 active running
[email protected] 0f5e2e11.../10.168.212.210 active running
[email protected] 2340c553.../10.225.17.229 active running
[email protected] fbd4562e.../10.182.197.145 active running
[email protected] 0f5e2e11.../10.168.212.210 active running
[email protected] 2340c553.../10.225.17.229 active running
[email protected] fbd4562e.../10.182.197.145 active running
[email protected] 0f5e2e11.../10.168.212.210 active running
At this point you should be able to access the REST API on the public ip any of the three Sync Gateway machines.
Installing elastic-thought on Vagrant (Staging mode)
This mode tries to replicate the Production mode described above, but on Vagrant instead of AWS.
Update Vagrant
Make sure you're running a current version of Vagrant, otherwise the plugin install below may fail.
$ vagrant -v
1.7.1
Install CoreOS on Vagrant
Clone the coreos/vagrant fork that has been customized for running ElasticThought.
$ cd ~/Vagrant
$ git clone [email protected]:tleyden/coreos-vagrant.git
$ cd coreos-vagrant
$ cp config.rb.sample config.rb
$ cp user-data.sample user-data
By default this will run a two node cluster, if you want to change this, update the $num_instances
variable in the config.rb
file.
Run CoreOS
$ vagrant up
Ssh in:
$ vagrant ssh core-01 -- -A
If you see:
Failed Units: 1
user-cloudinit@var-lib-coreos\x2dvagrant-vagrantfile\x2duser\x2ddata.service
Jump to Workaround CoreOS + Vagrant issues below.
Verify things started up correctly:
core@core-01 ~ $ fleectctl list-machines
If you get errors like:
2015/03/26 16:58:50 INFO client.go:291: Failed getting response from http://127.0.0.1:4001/: dial tcp 127.0.0.1:4001: connection refused
2015/03/26 16:58:50 ERROR client.go:213: Unable to get result for {Get /_coreos.com/fleet/machines}, retrying in 100ms
Jump to Workaround CoreOS + Vagrant issues below.
Workaround CoreOS + Vagrant issues:
First exit out of CoreOS:
core@core-01 ~ $ exit
On your OSX workstation, try the following workaround:
$ sed -i '' 's/420/0644/' user-data
$ sed -i '' 's/484/0744/' user-data
$ vagrant reload --provision
Ssh back in:
$ vagrant ssh core-01 -- -A
Verify it worked:
core@core-01 ~ $ fleectctl list-machines
You should see:
MACHINE IP METADATA
ce0fec18... 172.17.8.102 -
d6402b24... 172.17.8.101 -
I filed CoreOS cloudinit issue 328 to figure out why this error is happening (possibly related issues: CoreOS cloudinit issue 261 or CoreOS cloudinit issue 190)
Continue steps above
Scroll up to the Installing elastic-thought on AWS section and start with Verify CoreOS cluster
FAQ
- Is this useful for grid computing / distributed computation? Ans: No, this is not trying to be a grid computing (aka distributed computation) solution. You may want to check out Caffe Issue 876 or ParameterServer
Related Projects
- DeepDetect - REST Api wrapper for Caffe
- caffe-docker-classifier - Set up your own classifier API based on pre-trained GoogleNet model and Caffe.
License
Apache 2