• This repository has been archived on 25/Feb/2020
  • Stars
    star
    400
  • Rank 107,843 (Top 3 %)
  • Language
    Scala
  • License
    GNU Affero Genera...
  • Created almost 13 years ago
  • Updated about 11 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Advanced Analytics Engine for NoSQL Data

Precog

Precog is an advanced analytics engine for NoSQL data. It's sort of like a traditional analytics database, but instead of working with normalized, tabular data, it works with denormalized data that may not have a uniform schema.

You can plop large amounts of JSON into Precog and start doing analytics without any preprocessing, such as time series analytics, filtering, rollups, statistics, and even some kinds of machine learning.

There's an API for developer integration, and a high-level application called Labcoat for doing ad hoc and exploratory analytics.

Precog has been used by developers to build reporting features into applications (since Precog has very comprehensive, developer-friendly APIs), and together with Labcoat, Precog has been used by data scientists to perform ad hoc analysis of semi-structured data.

This is the Community Edition of Precog. For more information about commercial support and maintenance options, check out SlamData, Inc, the official sponsor of the Precog open source project.

Community

  • Precog-Dev ā€” An open email list for developers of Precog.
  • Precog-User ā€” An open email list for users of Precog.
  • #precog ā€” An IRC channel for Precog.
  • #quirrel ā€” An IRC channel for the Quirrel query language.

Developer Guide

A few landmarks:

  • common - Data structures and service interfaces that are shared between multiple submodules.

  • quirrel - The Quirrel compiler, including the parser, static analysis code and bytecode emitter

    • Parser
    • Binder
    • ProvenanceChecker
  • mimir - The Quirrel optimizer, evaluator and standard library

    • EvaluatorModule
    • StdLibModule
    • StaticInlinerModule
  • yggdrasil - Core data access and manipulation layer

    • TableModule
    • ColumnarTableModule
    • Slice
    • Column
  • niflheim - Low-level columnar block store. (NIHDB)

    • NIHDB
  • ingest - BlueEyes service front-end for data ingest.

  • muspelheim - Convergence point for the compiler and evaluator stacks; integration test sources and data

    • ParseEvalStack
    • MiscStackSpecs
  • surtr - Integration tests that run on the NIHDB backend. Surtr also provides a (somewhat defunct) REPL that gives access to the evaluator and other parts of the precog environment.

    • NIHDBPlatformSpecs
    • REPL
  • bifrost - BlueEyes service front-end for the

  • miklagard - Standalone versions for the desktop and alternate backend data stores -- see local README.rst. These need a bit of work to bring them up to date; they were disabled some time ago and may have bitrotted.

  • util - Generic utility functions and data structures that are not specific to any particular function of the Precog codebase; convenience APIs for external libraries.

Thus, to work on the evaluator, one would be in the mimir project, writing tests in the mimir and muspelheim projects. The tests in the muspelheim project would be run from the surtr project (not from muspelheim), but using the test data stored in muspelheim. All of the other projects are significantly saner.

Getting Started

Step one: obtain PaulP's script. At this point, ideally you would be able to run ./build-test.sh and everything would be fine. Unfortunately, at the present time, you have to jump through a few hoops in order to get all of the dependencies in order.

First, you need to clone and build blueeyes. This should be relatively painless. Grab the repository and run sbt publish-local. After everything finishes, you should be able to just move on to the next ball of wax: Kafka. Unfortunately, Kafka has yet to publish any public Maven artifacts, much less artifacts for precisely the version on which Precog is dependent. At the current time, the best way to deal with this problem is to simply grab the tarball of Ivy dependencies and extract this file into your ~/.ivy2/cache/ directory. Once this is done, you should be ready to go.

Altogether, you need to run the following commands:

$ git clone [email protected]:jdegoes/blueeyes.git
$ cd blueeyes
$ sbt publish-local
$ cd ..
$ cd /tmp
$ wget https://dl.dropboxusercontent.com/u/1679797/kafka-stuff.tar.gz
$ tar xf kafka-stuff.tar.gz -C ~/.ivy2/cache/
$ cd -
$ cd platform
$ sbt

From here, you must run the following tasks in order:

  • test:compile
  • ratatoskr/assembly
  • extract-data
  • test

The last one should take a fair amount of time, but when it completes (and everything is green), you can have a pretty solid assurance that you're up and running!

In order to more easily navigate the codebase, it is highly recommended that you install CTAGS, if your editor supports it. Our filename conventions areā€¦inconsistent.

Building and Running

These instructions are at best rudimentary, but should be sufficient to get started in a minimal way. More will be coming soon!

The Precog environment is organized in a modular, service-oriented fashion with loosely coupled components that are relatively tolerant to the failure of any single component (with likely degraded function). Most of the components allow for redundant instances of the relevant service, although in some cases (bifrost in particular) some tricky configuration is required, which will not be detailed here.

Services:

  • bifrost - The primary service for evaluating NIHDB
  • auth - Authentication provider (checks tokens and grants; to be merged with accounts in the near term)
  • accounts - Account provider (records association between user information and an account root token; to be merged with auth in the near term)
  • dvergr - A simple job tracking service that is used to track batch query completion.
  • ingest - The primary service for adding data to the Precog database.

Runnable jar files for all of these services can be built using the sbt assembly target from the root (platform) project. Sample configuration files for each can be found in the <projectname>/configs/dev directory for each relevant project; to run a simple test instance you can use the start-shard.sh script. Note that this will download, configure, and run local instances of mongodb, apache kafka, and zookeeper. Additional instructions for running the precog database in a server environment will be coming soon.

Contributing

All Contributions are bound by the terms and conditions of the Precog Contributor License Agreement.

Pull Request Process

We use a pull request model for development. When you want to work on a new feature or bug, create a new branch based off of master (do not base off of another branch unless you absolutely need the work in progress on that branch). Collaboration is highly encouraged; accidental branch dependencies are not. Your branch name should be given one of the following prefixes:

  • topic/ - For features, changes, refactorings, etc (e.g. topic/parse-function)
  • bug/ - For things that are broken, investigations, etc (e.g. bug/double-allocation)
  • wip/ - For code that is not ready for team-wide sharing (e.g. wip/touch-me-and-die)

If you see a topic/ or bug/ branch on someone else's repository that has changes you need, it is safe to base off of that branch instead of master, though you should still base off of master if at all possible. Do not ever base off of a wip/ branch! This is because the commits in a wip/ branch may be rewritten, rearranged or discarded entirely, and thus the history is not stable.

Do your work on your local branch, committing as frequently as you like, squashing and rebasing off of updated master (or any other topic/ or bug/ branch) at your discretion.

When you are confident in your changes and ready for them to land, push your topic/ or bug/ branch to your own fork of platform (you can create a fork here).

Once you have pushed to your fork, submit a Pull Request using GitHub's interface. Take a moment to describe your changes as a whole, particularly highlighting any API or Quirrel language changes which land as part of the changeset.

Once your pull request is ready to be merged, it will be brought into the staging branch, which is a branch on the mainline repository that exists purely for the purposes of aggregating pull requests. It should not be considered a developer branch, but is used to run the full build as a final sanity check before the changes are pushed as a fast forward to master once the build has completed successfully. This process ensures a minimum of friction between concurrent tasks while simultaneously making it extremely difficult to break the build in master. Build problems are generally caught and resolved in pull requests, and in very rare cases, in staging. This process also provides a very natural and fluid avenue for code review and discussion, ensuring that the entire team is involved and aware of everything that is happening. Code review is everyone's responsibility.

Rebase Policy

There is one hard and fast rule: if the commits have been pushed, do not rebase. Once you push a set of commits, either to the mainline repository or your own fork, you cannot rebase those commits any more. The only exception to this rule is if you have pushed a wip/ branch, in which case you are allowed to rebase and/or delete the branch as needed.

The reason for this policy is to encourage collaboration and avoid merge conflicts. Rewriting history is a lovely Git trick, but it is extremely disruptive to others if you rewrite history out from under their feet. Thus, you should only ever rebase commits which are local to your machine. Once a commit has been pushed on a non-wip/ branch, you no longer control that commit and you cannot rewrite it.

With that said, rebasing locally is highly encouraged, assuming you're fluent enough with Git to know how to use the tool. As a rule of thumb, always rebase against the branch that you initial cut your local branch from whenever you are ready to push. Thus, my workflow looks something like the following:

$ git checkout -b topic/doin-stuff
...
# hack commit hack commit hack commit hack
...
$ git fetch upstream
$ git branch -f master upstream/master
$ git rebase -i master
# squash checkpoint commits, etc
$ git push origin topic/doin-stuff

If I had based off a branch other than master, such as a topic/ branch on another fork, then obviously the branch names would be different. The basic workflow remains the same though.

Once I get beyond the last command though, everything changes. I can no longer rebase the topic/doin-stuff branch. Instead, if I need to bring in changes from another branch, or even just resolve conflicts with master, I need to use git merge. This is because someone else may have decided to start a project based on topic/doin-stuff, and I cannot just rewrite commits which they are now depending on.

To summarize: rebase privately, merge publicly.

Roadmap

Phase 1: Simplified Deployment

Precog was originally designed to be offered exclusively via the cloud in a multi-tenant offering. As such, it has made certain tradeoffs that make it much harder for individuals and casual users to install and maintain.

In the current roadmap, Phase 1 involves simplifying Precog to the point where there are so few moving pieces, anyone can install and launch Precog, and keep Precog running without anything more than an occasional restart.

The work is currently tracked in the Simplified Precog milestone and divided into the following tickets:

Many of these tickets indirectly contribute to Phase 2, by bringing the foundations of Precog closer into alignment with HDFS.

Phase 2: Support for Big Data

Currently, Precog can only handle the amount of data that can reside on a single machine. While there are many optimizations that still need to be made (such as support for indexes, type-specific columnar compression, etc.), a bigger win with more immediate impact will be making Precog "big data-ready", where it can compete head-to-head with Hive, Pig, and other analytics options for Hadoop.

Spark is an in-memory computational framework that runs as a YARN application inside a Hadoop cluster. It can read from and write to the Hadoop file system (HDFS), and exposes a wide range of primitives for performing data processing. Several high-performance, scalable query systems have been built on Spark, such as Shark and BlinkDB.

Given that Spark's emphasis is on fast, in-memory computation, that it's written in Scala, and that it has already been used to implement several query languages, it seems an ideal target for Precog.

The work is currently divided into the following tickets:

  • Introduce a "group by" operator into the intermediate algebra
  • Refactor solve with simpler & saner semantics
  • Create a table representation based on Spark's RDD
  • Implement table ops in terms of Spark operations
  • TODO

Alternate Front-Ends

Support for dynamically-typed, multi-dimensional SQL ("SQL for heterogeneous JSON"), and possibly other query interfaces such as JSONiq and UNQL.

License

This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License along with this program. If not, see <http://www.gnu.org/licenses/>.

Legalese

Copyright (C) 2010 - 2013 SlamData, Inc. All Rights Reserved. Precog is a registered trademark of SlamData, Inc, licensed to this open source project.

More Repositories

1

matryoshka

Generalized recursion schemes and traversals for Scala.
Scala
810
star
2

fs2-ssh

A wrapper around Apache SSHD targeting cats-effect and fs2
Scala
42
star
3

scala-pathy

A type-safe abstraction for platform-independent file system paths.
Scala
24
star
4

purescript-mra

A definition and reference implementation for MRA in PureScript.
PureScript
17
star
5

fs2-gzip

A pair of combinators for async gzip/gunzip support for fs2 byte streams
Shell
13
star
6

reportgrid

Beautiful, interactive, HTML5 visualization library in haXe / JavaScript.
JavaScript
13
star
7

client-libraries

Official client libraries for the ReportGrid API.
PHP
11
star
8

labcoat-new

The next-generation of Labcoat, a powerful development environment for Quirrel scripts.
JavaScript
9
star
9

slamdata-predef

Shell
7
star
10

staticsite

The wintersmith source for the static website
JavaScript
6
star
11

fs2-job

A generic job manager for fs2
Scala
4
star
12

infrastructure

A repository containing all software and configuration used for infrastructure deployment and maintenance.
Ruby
4
star
13

labcoat-legacy

The legacy version of Labcoat.
JavaScript
3
star
14

quasar-datasource-s3

Quasar connector providing support for Amazon S3 as a data source
Scala
3
star
15

quirrel-lang

Home of the Quirrel query language website.
CSS
3
star
16

precog_ruby_client

Precog's Ruby client library
Ruby
3
star
17

Sector7

Provisioning tools/scripts
Scala
2
star
18

quasar-datasource-mongo

MongoDB quasar connector
Scala
2
star
19

demo

Scala
2
star
20

async-blobstore

Scala
2
star
21

precog_js_client

Precog's javascript client library
JavaScript
2
star
22

kafka

A fork of Kafka
Scala
2
star
23

precog_python_client

Precog's Python client library
Python
1
star
24

precog_dotnet_client

Precog's .Net client library
C#
1
star
25

services-1970

A repository containing all services that collectively make up the server backend.
Scala
1
star
26

xsbt-alt-deps

A couple of utility classes for fallback from local project dependencies.
Scala
1
star
27

quasar-plugin-sql-server

MS SQL Server Plugins
Scala
1
star
28

labcoat

Labcoat 2
JavaScript
1
star
29

precog.github.io

Website for precog.com.
CSS
1
star
30

leveldb-perftest

Scratch work for perf testing
C++
1
star
31

quirrelide

Quirrel IDE
JavaScript
1
star
32

visualizations

A repository containing all visualizations built on the ReportGrid platform.
JavaScript
1
star
33

sbt-precog

Common build configuration for SBT projects
Scala
1
star
34

sample-clients

Sample code for using ReportGrid client libraries.
Java
1
star
35

precog_java_client

Precog's java client library
Java
1
star
36

scala-xml-names

Datatypes and refined predicates for XML Name, NCName and QName.
Scala
1
star
37

sql2-parser

Shell
1
star
38

website

A repository containing the ReportGrid website.
JavaScript
1
star
39

kormir

A small library of optimized data structures
Shell
1
star
40

leveldbjni-specs

Bring specs2 and scalacheck goodness to LevelDBJNI
Scala
1
star
41

quasar-destination-avalanche

Scala
1
star