• Stars
    star
    136
  • Rank 266,724 (Top 6 %)
  • Language
    Clojure
  • License
    Eclipse Public Li...
  • Created over 10 years ago
  • Updated over 8 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

write-once key/value storage engine

Riffle is a read-only key/value storage format, strongly influenced by the cdb and sorted-string table formats. Like cdb, it has a fixed memory cost per key (12 bytes per key), rather than having to keep the entire keyspace in memory. Like sorted-string tables, it allows for block compression of the values, and allows for files to be merged in linear time. Like both formats, a typical lookup requires a single disk read.

Riffle files can be built either locally or via Hadoop, allowing for datasets comprising billions of entries to be compiled into a set of sharded Riffle files.

getting started

To use Riffle in your project, add this to your project.clj:

[factual/riffle "0.1.3"]

To use the riffle command-line tool, clone the Riffle repository, make sure Leiningen is installed, and then install the tool:

cd /tmp
git clone https://github.com/Factual/riffle.git
cd riffle
./scripts/install.sh DIRECTORY

where DIRECTORY is a directory on your working $PATH. Now you can use the riffle tool to build, read, validate and benchmark files.

Let's build a small Riffle file using a TSV key/value file:

echo -e "1\t2\n\3\t4\n" | riffle build > /tmp/riffle

This is equivalent to the map {"1" "2", "3" "4"}. Now we can do simple things like list the keys in the file, and look up values:

> riffle -k /tmp/riffle
3
1

> riffle -g 3 /tmp/riffle
4

We can pass in an arbitrary number of Riffle files, in which case the right-most files will take precedence:

> echo -e "3\t5" | riffle build > /tmp/riffle2

> riffle -k /tmp/riffle /tmp/riffle2
3
4
1

> riffle -g 3 /tmp/riffle /tmp/riffle2
5

> riffle -g 3 /tmp/riffle2 /tmp/riffle
4

We can also arbitrarily combine TSV and Riffle files to create new Riffle files.

> echo -e "1\t42" > /tmp/input.tsv

> riffle build /tmp/riffle /tmp/riffle2 /tmp/input.tsv > /tmp/riffle3

> riffle -g 1 /tmp/riffle3
42

Riffle stores keys and values as binary data, but for the convenience of the command-line tool all data is treated as plaintext. To build a file with binary data, you can specify that the input is Base64 encoded with the -b flag:

> echo -e "`echo -n hello | base64`\t`echo -n goodbye | base64`" > /tmp/binary.tsv

> riffle build -b /tmp/binary.tsv > /tmp/binary-riffle

> riffle -kb /tmp/binary-riffle
aGVsbG8

> riffle -k /tmp/binary-riffle
hello

Additional tasks include validate and benchmark

> riffle validate /tmp/binary-riffle
1 block(s), 67.00 average bytes per compressed block
no bad blocks

> riffle benchmark /tmp/binary-riffle
with 1 reader:
throughput: 28456.82 reads/sec
latencies (in ms):
  25.0%  0.03
  50.0%  0.03
  75.0%  0.03
  90.0%  0.04
  95.0%  0.05
  99.0%  0.07
  99.9%  0.19

...

riffle and hadoop

To compile a Riffle index via Hadoop, you can use riffle hadoop build src1 src2 ... srcN dst, which takes tab-delimited text input files and builds Riffle indices, and riffle hadoop merge src1 src2 ... srcN dst, which takes multiple Riffle indices and merges them together, with precedence given to the right-most index. These commands must be run in a context where the Hadoop environment is already configured.

To build from a source other than tab-delimited files, it's recommended that you customize the mapper for the RiffleBuildJob, which is trivial to modify. Once modified, your custom Hadoop job can be installed via scripts/install.sh, and invoked via the same riffle hadoop ... mechanism.

riffle as a library

To build a Riffle index at runtime, use riffle.write/write-riffle, which takes a sequence of key/value tuples, an output file, and an optional set of parameters.

> (require '[riffle.write :as w] '[riffle.read :as r])
nil
> (write-riffle [["a" "b"] ["c" "d"]] "/tmp/riffle4")
#<File /tmp/riffle4>

This file may be loaded as an index using riffle.read/riffle and accessed via riffle.read/get and riffle.read/entries:

> (def riff (r/riffle "/tmp/riffle4"))
#'riff
> (r/get riff "a")
#<byte[] [B@6a2361b0>

Notice that get returns a binary representation of the value.

license

Copyright Β© 2014 Factual, Inc

Distributed under the Eclipse Public License v1.0

More Repositories

1

drake

Data workflow tool, like a "Make for data"
Clojure
1,481
star
2

skuld

Distributed task tracking system.
Clojure
300
star
3

geo

Clojure library for working with geohashes, polygons, and other world geometry
Clojure
294
star
4

s3-journal

stable, high-throughput journalling to S3
Clojure
100
star
5

clj-leveldb

Clojure bindings for LevelDB
Clojure
75
star
6

timely

Timely: A clojure dsl for cron and scheduling library
Clojure
35
star
7

open-dockerfiles

Factual's open source dockerfiles
Shell
28
star
8

parquet-rewriter

A library to mutate parquet files
Java
18
star
9

beercode-open

Open-source code backed by the Factual Beer Guarantee
Java
17
star
10

clj-helix

Clojure bindings for Apache Helix
Clojure
13
star
11

c4

Convenience features for handling record files the Clojure way
Clojure
9
star
12

eliza

Clojure
6
star
13

sosueme

A collection of Clojure functions for things we like to do.
Clojure
6
star
14

patchwork

Factual Dependency Management Tool
Python
5
star
15

docker-mariadb-10.0-galera

Shell
5
star
16

solr-mapreduce-indexer

Partial copy of solr/lucene contrib mapreduce indexer tool that works on Solr 6.x with some bug fixes and dependencies compiled in
Java
4
star
17

smaker

Smaker extends the standard Snakemake library by 1) supporting arbitrary snakefile aggregation/re-use through 2) middleware that parses generic wildcards in snakefiles.
Python
4
star
18

torpedo

Lets you torpedo complex functional expressions
Clojure
2
star
19

drake-interface

Defines Drake interfaces
Clojure
2
star
20

factual-android-sdk-demo

Java
2
star
21

engine-segment-integration-android

Factual Engine / Segment Analytics Android Integration
Java
1
star
22

docker-collins

Dockerfile
1
star
23

marathon-apps-exporter

Marathon Apps Exporter For Prometheus
HTML
1
star
24

kudos

Ruby
1
star
25

docker-osm2pgsql

Docker image for running osm2pgsql.
Shell
1
star
26

sdk-examples

Factual SDK Examples
Java
1
star
27

jackalope

Github integration service to support custom release planning processes
Clojure
1
star