• Stars
    star
    100
  • Rank 339,430 (Top 7 %)
  • Language
    Clojure
  • License
    Eclipse Public Li...
  • Created over 10 years ago
  • Updated almost 9 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

stable, high-throughput journalling to S3

This library allows an ordered stream of entries to be uploaded to Amazon's S3 datastore. It is implemented using Factual's durable-queue library, which means that entries will survive process death, and that memory usage will not be affected by stalls in the uploading process. Despite this, on a c1.xlarge AWS instance it can easily journal more than 10k entries/sec, comprising more than 10mb/sec in their compressed serialized form.

However, this is not a distributed or replicated store, and in the case of node failure may lose data. The amount of data lost will typically be less than 5mb (the minimum upload size allowed by the S3 service), but this library should not be used in any application which cannot tolerate this sort of data loss. The ideal use case is high-throughput logging, especially where external infrastructure is unavailable or impractical.

usage

[factual/s3-journal "0.1.2"]

This library exposes only three functions in the s3-journal namespace: journal, which constructs a journal object that can be written to, put!, which writes to the journal, and stats, which returns information about the state of the journal.

All configuration is passed in as a map to (journal options), with the following parameters:

name optional? description
:s3-access-key yes your AWS access key
:s3-secret-key yes your AWS secret key
:s3-bucket yes the AWS bucket that will be written to, must already exist
:s3-directory-format no the directory format, as a SimpleDateFormat string, should not have leading or trailing slashes, defaults to yyyy/MM/dd
:local-directory yes the directory on the local file system that will be used for queueing, will be created if doesn't already exist
:encoder no a function that takes an entry and returns something that can be converted to bytes via byte-streams
:compressor no Either one of :gzip, :snappy, :lzo, :bzip2, or a custom function that takes a sequence of byte-arrays and returns a compressed representation
:delimiter no a delimiter that will be placed between entries, defaults to a newline character
:max-batch-latency yes a value, in milliseconds, of how long entries should be batched before being written to disk
:max-batch-size yes the maximum number of entries that can be batched before being written to disk
:fsync? no describes whether the journal will fsync after writing a batch to disk, defaults to true
:id no a globally unique string describing the journal which is writing to the given location on S3, defaults to the hostname
:expiration no the maximum time, in milliseconds, pending uploads from other processes will be allowed to remain open without being closed by this process. This prevents orphaned multipart uploads from processes which are permanently shutdown persisting forever in a partially updated state (and thus remaining invisible to normal S3 operations). By default this is set to nil, which deactivates the expiration behavior.
:shards no the number of top-level directories within the bucket to split the entries across, useful for high-throughput applications, defaults to nil

Fundamentally, the central tradeoff in these settings are data consistency vs throughput.

If we persist each entry as it comes in, our throughput is limited to the number of IOPS our hardware can handle. However, if we can afford to lose small amounts of data (and we almost certainly can, otherwise we'd be writing each entry to a replicated store individually, rather than in batch), we can bound our loss using the :max-batch-latency and :max-batch-size parameters. At least one of these parameters must be defined, but usually it's best to define both. Defining our batch size bounds the amount of memory that can be used by the journal, and defining our batch latency bounds the amount of time that a given entry is susceptible to the process dying. Setting :fsync? to false can greatly increase throughput, but removes any safety guarantees from the other two parameters - use this parameter only if you're sure you know what you're doing.

If more than one journal on a given host is writing to the same bucket and directory on S3, a unique identifier for each must be chosen. This identifier should be consistent across process restarts, so that partial uploads from a previous process can be properly handled. One approach is to add a prefix to the hostname, which can be determined by (s3-journal/hostname).

Calling (.close journal) will flush all remaining writes to S3, and only return once they have been successfully written. A journal which has been closed cannot accept any further entries.

Calling (stats journal) returns a data structure in this form:

{:queue {:in-progress 0
         :completed 64
         :retried 1
         :enqueued 64
         :num-slabs 1
         :num-active-slabs 1}
 :enqueued 5000000
 :uploaded 5000000}

The :enqueued key describes how many entries have been enqueued, and the :uploaded key how many have been uploaded to S3. The :queue values correspond to the statistics reported by the underlying durable-queue.

logging

The underlying AWS client libraries will log at the INFO level whenever there is an error calling into AWS. This is emulated by s3-journal - recoverable errors are logged as INFO, and unrecoverable errors, such as corrupted data read back from disk, are logged as WARN. In almost all cases, the journal will continue to work in the face of these errors, but a block of entries may be lost as a result.

license

Copyright Β© 2014 Factual, Inc.

Distributed under the Eclipse Public License version 1.0.

More Repositories

1

drake

Data workflow tool, like a "Make for data"
Clojure
1,481
star
2

skuld

Distributed task tracking system.
Clojure
300
star
3

geo

Clojure library for working with geohashes, polygons, and other world geometry
Clojure
294
star
4

riffle

write-once key/value storage engine
Clojure
136
star
5

clj-leveldb

Clojure bindings for LevelDB
Clojure
75
star
6

timely

Timely: A clojure dsl for cron and scheduling library
Clojure
35
star
7

open-dockerfiles

Factual's open source dockerfiles
Shell
28
star
8

parquet-rewriter

A library to mutate parquet files
Java
18
star
9

beercode-open

Open-source code backed by the Factual Beer Guarantee
Java
17
star
10

clj-helix

Clojure bindings for Apache Helix
Clojure
13
star
11

c4

Convenience features for handling record files the Clojure way
Clojure
9
star
12

eliza

Clojure
6
star
13

sosueme

A collection of Clojure functions for things we like to do.
Clojure
6
star
14

patchwork

Factual Dependency Management Tool
Python
5
star
15

docker-mariadb-10.0-galera

Shell
5
star
16

solr-mapreduce-indexer

Partial copy of solr/lucene contrib mapreduce indexer tool that works on Solr 6.x with some bug fixes and dependencies compiled in
Java
4
star
17

smaker

Smaker extends the standard Snakemake library by 1) supporting arbitrary snakefile aggregation/re-use through 2) middleware that parses generic wildcards in snakefiles.
Python
4
star
18

torpedo

Lets you torpedo complex functional expressions
Clojure
2
star
19

drake-interface

Defines Drake interfaces
Clojure
2
star
20

factual-android-sdk-demo

Java
2
star
21

engine-segment-integration-android

Factual Engine / Segment Analytics Android Integration
Java
1
star
22

docker-collins

Dockerfile
1
star
23

marathon-apps-exporter

Marathon Apps Exporter For Prometheus
HTML
1
star
24

kudos

Ruby
1
star
25

docker-osm2pgsql

Docker image for running osm2pgsql.
Shell
1
star
26

sdk-examples

Factual SDK Examples
Java
1
star
27

jackalope

Github integration service to support custom release planning processes
Clojure
1
star