• Stars
    star
    247
  • Rank 164,117 (Top 4 %)
  • Language
    Java
  • License
    Apache License 2.0
  • Created about 10 years ago
  • Updated over 9 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Fast and efficient batch computation engine for complex analysis and reporting of massive datasets on Hadoop

Cubert is a fast and efficient batch computation engine for complex analysis and reporting of massive datasets on Hadoop.

Cubert Documentation hosted at github.

Cubert Users Google Group: cubert-users. Email: [email protected]

Cubert is ideally suited for the following application domains:

  1. Statistical Calculations, Joins and Aggregations

    Cubert introduces a new model of computation that allows users to organize data in a format that is ideally suited for scalable execution of subsequent query processing operators, and a set of algorithmically-efficient operators (MeshJoin and CUBE) that exploit the organization to provide significantly improved CPU and resource utilization compared to existing solutions.

  2. Cubes and Grouping Set Aggregations

    The power-horse is the new CUBE operator that can efficiently (CPU and memory) compute additive, non-additive (e.g. Count Distinct) and exact percentile rank (e.g. Median) statistics; can roll up inner dimensions on-the-fly and compute multiple measures within a single job.

  3. Time range calculation and Incremental computations

    Cubert primitives are specially suited for reporting workflows that employ computation pattern that is both regular and repetitive, allowing for efficiency gains from partial result caching and incremental processing.

  4. Graph computations

    Cubert provides a novel sparse matrix multiplication algorithm that is best suited for analytics with large-scale graphs.

  5. When performance or resources are a matter of concern

    Cubert Script is a developer-friendly language that takes out the hints, guesswork and surprises when running the script. The script provides the developers complete control over the execution plan (without resorting to low-level programming!), and is extremely extensible by adding new functions, aggregators and even operators.

A flavor of Cubert Script

Cubert script is a physical script where we explicitly define the operators at the Mappers, Reducers and Combiners for the different jobs. Following is an example of the Word Count problem written in cubert script.

JOB "word count job"
    REDUCERS 10;
    MAP {
            // load the input data set as a TEXT file
            input = LOAD "$CUBERT_HOME/examples/words.txt" USING TEXT("schema": "STRING word");
            // add a column to each tuple
            with_count = FROM input GENERATE word, 1 AS count;
    }
    // shuffle the data and also invoke combiner to aggregate on map-side
    SHUFFLE with_count PARTITIONED ON word AGGREGATES COUNT(count) AS count;
    REDUCE {
            // at the reducers, sum the counts for each word
            output = GROUP with_count BY word AGGREGATES SUM(count) AS count;
    }
	// store the output using TEXT format
    STORE output INTO "output" USING TEXT();
END

While the Cubert Script code above is already very concise representation of the Word Count problem; as a matter of interest, the idiomatic way of writing in Cubert is even more concise (and a lot faster)!

JOB "idiomatic word count program (even more concise!)"
        REDUCERS 10;
        MAP {
                input = LOAD "$CUBERT_HOME/examples/words.txt" USING TEXT("schema": "STRING word");
        }
        CUBE input BY word AGGREGATES COUNT(word) AS count GROUPING SETS (word);
        STORE input INTO "output" USING TEXT();
END

Installation

Download or clone the repository (say, into /path/to/cubert) and run the following command:

$ cd /path/to/cubert
$ ./gradlew

This will create a folder /path/to/cubert/release, which is what we will need to run cubert. This folder can be copied to hadoop cluster gateway.

To run cubert, first make sure that Hadoop is installed and the HADOOP_HOME environment variable points to the hadoop installation. Set the CUBERT_HOME environment to the release folder (note: CUBERT_HOME points to the release folder and not the "root" repository folder).

$ export CUBERT_HOME=/path/to/cubert/release
$ $CUBERT_HOME/bin/cubert -h
	Using HADOOP_CLASSPATH=:/path/to/cubert/release/lib/*

	usage: ScriptExecutor <cubert script file> [options]
	 -c,--compile             stop after compilation
	 -D <property=value>      use value for given property
	 -d,--debug               print debuging information
	 ...

Example Scripts: Sample scripts are available in the $CUBERT_HOME/examples folder.

Cubert Operators

Cubert provides a rich suite of operators for processing data. These include:

  • Input/Output operators: LOAD, STORE, TEE (storing data at intermediate stages of computation into a separate folder) and LOAD-CACHED (read from Hadoop's distributed cache).
  • Transformation operators: FROM .. GENERATE (similar to SQL's SELECT), FILTER, LIMIT, TOPN (select top N rows from a window of rows), DISTINCT, SORT, DUPLICATE and FLATTEN (transform nested rows into flattened rows), MERGE JOIN and HASH JOIN.
  • Aggregation operators: GROUP BY and CUBE (OLAP cube for additive and non-additive aggregates).
  • Data movement operators: SHUFFLE (standard Hadoop shuffle), BLOCKGEN (create partitioned data units, aka blocks), COMBINE (combine two or more blocks), PIVOT (create sub-blocks out of a block, based on specified partitioning scheme).
  • Dictionary operators to encode and decode strings to integers.

Coming (really) soon..

  • Tez execution engine: the existing cubert scripts (written in Map-Reduce paradigm) can run unmodified on underlying Tez engine.
  • Cubert Script v2: a DAG oriented language to express complex composition of workflows.
  • Incremental computations: daily running jobs that read multiple days of data can materialize partial output and incrementally compute the results (thus cutting down on resources and time).
  • Analytical window functions.

Documentation

Users Guide and Javadoc available at

http://linkedin.github.io/Cubert

Cubert Users Google Group: cubert-users

Email: [email protected]

More Repositories

1

hopscotch

A framework to make it easy for developers to add product tours to their pages.
JavaScript
4,200
star
2

LayoutKit

LayoutKit is a fast view layout library for iOS, macOS, and tvOS.
Swift
3,162
star
3

camus

LinkedIn's previous generation Kafka to HDFS pipeline.
Java
883
star
4

indextank-engine

Indexing engine for IndexTank
Java
844
star
5

LIExposeController

Expose style navigation for iOS apps
Objective-C
742
star
6

Selene

iOS library which schedules the execution of tasks on a background fetch
Objective-C
642
star
7

datafu

Hadoop library for large-scale data processing, now an Apache Incubator project
Java
585
star
8

cleo

A flexible, partial, out-of-order and real-time typeahead search library
Java
559
star
9

sensei

distributed realtime searchable database
Java
540
star
10

inject

AMD and CJS dependency management in the browser
JavaScript
464
star
11

indextank-service

The API, BackOffice, Storefront, and Nebulizer for IndexTank
Python
382
star
12

venus.js

where bugs go to die
JavaScript
298
star
13

Fiber

Lightweight JavaScript prototypal inheritance model
JavaScript
279
star
14

sepia

Sepia is a VCR-like module for node.js that records HTTP interactions, then plays them back exactly like the first time they were invoked
JavaScript
278
star
15

JTune

A high precision Java CMS optimizer
Python
271
star
16

scanns

A scalable nearest neighbor search library in Apache Spark
Scala
253
star
17

naarad

Naarad is a highly configurable system analysis tool that parses and plots timeseries data for better visual correlation. Naarad was built to help in performance analysis and investigations.
Python
238
star
18

simoorg

Failure inducer framework
Python
191
star
19

white-elephant

Hadoop log aggregator and dashboard
Java
191
star
20

nginx-config-builder

A python library for building nginx configuration files programatically
Python
170
star
21

Zopkio

A Functional and Performance Test Framework for Distributed Systems
Python
160
star
22

fossor

A plugin-oriented tool for automating the investigation of broken hosts and services.
Python
158
star
23

api-get-started

LinkedIn REST API Getting Started Tutorial
Java
158
star
24

dustjs-helpers

Helpers for dustjs-linkedin
JavaScript
114
star
25

archetype

Archetype is a Compass/Sass based framework for authoring configurable, composable UI components and patterns.
Ruby
102
star
26

Isaac

This library parses data from JSON objects into NSObject models without needing to write parsing code for each model.
Objective-C
97
star
27

linkedin-utils

Base utilities shared by all linkedin open source projects
Java
88
star
28

lafayette

Lafayette is a system to store various email abuse reports sent in ARF.
Python
74
star
29

rest.li-api-hub

API Hub is a web UI for browsing and searching a catalog of Rest.li APIs.
Scala
73
star
30

jaqen

Jaqen - Simple DNS rebinding
Go
70
star
31

Backbone.TableView

Backbone View to render collections as tables
CoffeeScript
70
star
32

linkedin-zookeeper

This project provides utilities and wrappers around ZooKeeper
Java
64
star
33

sometime

A BurpSuite plugin to detect Same Origin Method Execution vulnerabilities
Java
60
star
34

RookBoom

A web application for creating meetings.
Scala
45
star
35

datacl

A collection of efficient utilities for a data scientist.
C
40
star
36

mobster

Mobster is a tool that can help you get deeper understanding into the performance of mobile web applications on real mobile devices
Python
38
star
37

vagrant-autodns

Vagrant plugin for automagically managing guest DNS
Ruby
36
star
38

dmarc-msys

This set of scripts in Lua implements DMARC policy checking and reporting for the Message Systems MTA products, a popular extendable commercial MTA.
Lua
36
star
39

talkin

TalkIn is an interface providing safe and easy unidirectional cross-document communication.
JavaScript
31
star
40

play-testng-plugin

TestNG runner for the Play Framework 2.4
Java
24
star
41

sin

JavaScript
24
star
42

insframe

Central hub for distributing web apps to multiple browsers on multiple environments
JavaScript
22
star
43

Tachyon-iOS

Tachyon provides configurable UI components for iOS that are commonly used in calendar features and applications.
Objective-C
21
star
44

postcss-lang-optimizer

PostCSS plugin to extract language specific CSS rulesets to their own CSS files to optimize stylesheet delivery.
JavaScript
21
star
45

bowser

Extensible language parser with Python-like syntax. Written in Java and antlr.
Java
18
star
46

adfullssl

AdFullSsl is a tool that can automatically detect SSL non-compliant ads and fix them
Python
16
star
47

dustjs-filters-secure

extend dustjs-linkedin to enhance the filters methods
JavaScript
15
star
48

gradle-plugin-insight

Automatic, effortless, accurate documentation for any Gradle plugin
Groovy
13
star
49

timingz.js

Measure code execution in the browser and derive statistical data
JavaScript
13
star
50

Idiomatic-JSLint

JavaScript
12
star
51

streaming

10
star
52

custom-gradle-plugin-portal

An example implementation of a gradle plugin portal.
Java
9
star
53

sbt-restli

A collection of sbt plugins providing build integration for the rest.li REST framework
Scala
9
star
54

PTYHooks

Python
9
star
55

MTBT

Java
9
star
56

inject-bower

Please use linkedin/inject
JavaScript
6
star
57

rest.li-skeleton.g8

Rest.li tool for generating skeleton rest.li projects.
Shell
5
star
58

naarad-examples

Example logs and configs for naarad
3
star
59

html5-presentation

Code for the "Building a Performant HTML5 App" presentation at http://www.meetup.com/SF-Web-Performance-Group/events/71651452/
2
star