• Stars
    star
    162
  • Rank 232,284 (Top 5 %)
  • Language
    Clojure
  • License
    Eclipse Public Li...
  • Created over 10 years ago
  • Updated over 10 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Skream is a high-performance time-series library with memory-footprint guarantees.

This README is just a quick overview of the Skream project.

Skream

Skream is a high-performance time-series library with memory-footprint guarantees. The name is short for "online SKetching and stREAMing," and is pronounced like "scream" in English. For scalability, Skream is written in Clojure and has zero side-effects. The library includes a simple RESTful web-service for clients.

Artifacts

skream artifacts are released to Clojars.

If you are using Maven, add the following repository definition to your pom.xml:

<repository>
  <id>clojars.org</id>
  <url>http://clojars.org/repo</url>
</repository>

The Most Recent Release

With Leiningen:

[skream "0.0.2"]

With Maven:

<dependency>
  <groupId>skream</groupId>
  <artifactId>skream</artifactId>
  <version>0.0.2</version>
</dependency>

Features

Skream views a time-series as a simple sequence of numbers. These numbers are read sequentially and in order, with minimal state maintained. These sequences can be sensor readings from an Internet -of- things project, or stock prices in an HFT strategy.

What queries does a Skream currently support?

  • Basic summary statistics (count, minimum, maximum, sum, mean / average)
  • Variance & standard deviation
  • Skewness & kurtosis
  • Arbitrary higher-order moments (standardized & unstandardized)
  • Range counts
  • Gaussian range counts (e.g. count of elements within 0.42 standard deviations)
  • Histograms (evenly-spaced & Gaussian bins)
  • Exponential moving average
  • Simple moving average
  • Approximate membership (via Bloom filters)
  • Approximate individual element counts (via Count-Min sketches)
  • Distinct element count (via HyperLogLog sketches)
  • Approximate median (via P2 algorithm)
  • Approximate arbitrary quantiles (e.g. 25% "median")
  • Approximate mutual information between two Skreams (via histograms)

All of these queries are supported with a fixed memory footprint, new numbers added sequentially or in an online sense. The exception is simple moving average queries, which require a window of recent numbers maintained as state.

Example

Skream includes a simple command-line interface for experimenting and testing. First clone the Git repository, and then build & run locally using the usual lein executable:

$ git clone [email protected]:someben/skream.git
$ cd skream; lein run
>>> SKREAM 10162

The first header line written to STDERR indicates that a single empty Skream is ready to accept numbers. The header line also reports the PID of the underlying JVM process (e.g. 10162).

Next send a command to tell the Skream which queries to track. For a simple test, just use the "track-default" convenience function that tracks a handful of common, useful statistics. Type the following as STDIN input:

(track-default *sk*)

The *sk* earmuffed root variable refers to the process' single Skream. The server responds with the empty Skream, but it is now tracking some statistics and ready for queries:

{:min nil, [:unstd-moment 2] nil, [:moment 4] nil, [:moment 2] nil, :mean nil, [:win 5] [], :skew nil, :kurt nil, [:unstd-moment 4] nil, [:quantile 0.5] nil, :stdev nil, [:moment 3] nil, :max nil, [:unstd-moment 3] nil, :count 0, :last nil, :sum 0}

The Skream is printed to the console as a Clojure map literal (s-expression). Now you can send a stream of numbers to the same process, by entering numbers as STDIN input one -per- line:

42.42

The server responds with a Skream that has seen one number:

{:min 42.42, [:unstd-moment 2] 0, [:moment 4] 0, [:moment 2] 0, :mean 42.42, [:win 5] [42.42], :skew 0, :kurt 0, [:unstd-moment 4] 0, [:quantile 0.5] {:quantile nil}, :stdev 0, :sum-sq-diffs 0, [:moment 3] 0, :median nil, :max 42.42, [:unstd-moment 3] 0, :count 1, :last 42.42, :sum 42.42}

If you pass a number to this interface (e.g. "42.42" above), this is interpreted as a new number to be added to the Skream. In every other case (e.g. the "track-default" function call), the result becomes the new Skream itself.

Next we restart the process, and add a short time-series of recent closing prices of the SPY ETF traded on the NYSE:

...hit (Ctrl-C) or (Ctrl-D)
$ lein run  # from user
>>> SKREAM 11193
(track-default *sk*)  ; from user
{:min nil, ..., :sum 0}
188.74  ; from user
{:min 188.74, ..., :sum 188.74}
187.55  ; from user
189.13  ; from user
...
195.94  ; from user
195.88  ; from user

The final response from the server process shows some useful statistics that have been calculated on the fly:

{:min 187.55, [:unstd-moment 2] 157.20218399999857, [:moment 4] 54.60116919834653, [:moment 2] 24.0, :mean 193.17920000000004, [:win 5] [194.83 196.26 196.48 195.94 195.88], :skew -0.6849247938368417, :kurt -0.49252635219653795, [:unstd-moment 4] 2342.5917520379735, [:quantile 0.5] {:quantile 193.72229, :qs [187.55 191.74467 193.72229 195.5992 196.48], :ns [1 7 13 19 25], :dns [0 1/4 1/2 3/4 1], :nps [1 7N 13N 19N 25]}, :stdev 2.5593145566733178, :sum-sq-diffs 157.20218399999857, [:moment 3] -15.123139447917465, :median 193.72229, :max 196.48, [:unstd-moment 3] -253.52042674559698, :count 25, :last 195.88, :sum 4829.48}

Note the higher moment like kurtosis (-0.49ish) and median (about 193.72), estimated without maintaining a full history of the stock prices.

License

Skream is released under the Eclipse Public License, so you can easily incorporate the library into your commercial or non-commercial projects.

Architecture

Skream is a Leiningen Clojure project with decent automated test coverage. The main data-structure are simple Clojure maps with sequential updates handled by functions in map metadata.

Performance

Everything is done without side-effects, in-memory, and with only the minimal amount of state. This provides fundamental scalability across large time-series. Side-effect-less updates are done in parallel, utilizing every core (CPU) on the server.