• Stars
    star
    264
  • Rank 154,314 (Top 4 %)
  • Language
    Java
  • License
    Apache License 2.0
  • Created over 12 years ago
  • Updated 9 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Multidimensional data storage with rollups for numerical data

Introduction

A data cube is an abstraction for counting things in complicated ways (Wikipedia). This project is a Java implementation of a data cube backed by a pluggable database backend.

The purpose of a data cube is to store aggregate information about large numbers of data points. The data cube stores aggregate information about interesting subsets of the input data points. For example, if you're writing a web server log analyzer, your input points could be log lines, and you might be interested in keeping a count for each browser type, each browser version, OS type, OS version, and other attributes. You might also be interested in counts for a particular combination of (browserType,browserVersion,osType), (browserType,browserVersion,osType,osVersion), etc. It's a challenge to quickly add and change counters without wasting time writing database code and reprocessing old data into new counters. A data cube helps you keep these counts. You declare what you want to count, and the data cube maintains all the counters as you supply new data points.

A bit more mathily, if your input data points have N attributes, then the number of counters you may have to store is the product of the cardinalities of all N attributes in the worst case. The goal of the datacube project is to help you maintain these counters in a simple declarative way without any nested switch statements or other unpleasantness.

Urban Airship uses the datacube project to support its analytics stack for mobile apps. We handle about ~10K events per second per node.

Requires JDK 1.6.

Features

  • Performance: high-speed asynchronous batching IO backend
  • Bulk loading with Hadoop MapReduce
  • Pluggable database interface

IO

Each input data point may affect multiple counts in the data cube. For example, if you're counting events with a timestamp, a single event may increment the count for its hour, day, month, and year, ending up with four increments that must be applied to the database. Updating the database for each of these increments wouldn't scale to thousands of events per second, so we use the standard trick of batching counter updates in the client. When an input data point is given to the data cube, it updates a batch in memory for each of the affected counters. Periodically the batches are flushed to the backing database. If a single counter is incremented multiple times in the same batch, the increments are combined into a single database update.

TODO parameters to tune, implementation details to explain parameters

Bulk loading / backfilling

Database backend

A data cube can be backed by any database that supports a key-value interface and allows iterating over keys. To add support for a new database backend, implement the DbHarness interface and optionally the IdService interface. See HBaseDbHarness.java and HBaseIdService.java for examples. If you add support for a new database, we'd love to have you contribute your work back into the datacube project.

Currently HBase is the only supported backing database.

Example

IdService idService = new CachingIdService(5, new MapIdService());
ConcurrentMap<BoxedByteArray,byte[]> backingMap = 
        new ConcurrentHashMap<BoxedByteArray, byte[]>();
        
DbHarness<LongOp> dbHarness = new MapDbHarness<LongOp>(backingMap, LongOp.DESERIALIZER, 
        CommitType.READ_COMBINE_CAS, idService);

HourDayMonthBucketer hourDayMonthBucketer = new HourDayMonthBucketer();

Dimension<DateTime> time = new Dimension<DateTime>("time", hourDayMonthBucketer, false, 8);
Dimension<String> zipcode = new Dimension<String>("zipcode", new StringToBytesBucketer(), 
        true, 5);
        
DataCubeIo<LongOp> cubeIo = null;
DataCube<LongOp> cube;
        
Rollup hourAndZipRollup = new Rollup(zipcode, time, HourDayMonthBucketer.hours);
Rollup dayAndZipRollup = new Rollup(zipcode, time, HourDayMonthBucketer.days);
Rollup hourRollup = new Rollup(time, HourDayMonthBucketer.hours);
Rollup dayRollup = new Rollup(time, HourDayMonthBucketer.days);
        
List<Dimension<?>> dimensions =  ImmutableList.<Dimension<?>>of(time, zipcode);
List<Rollup> rollups = ImmutableList.of(hourAndZipRollup, dayAndZipRollup, hourRollup,
        dayRollup);
        
cube = new DataCube<LongOp>(dimensions, rollups);

cubeIo = new DataCubeIo<LongOp>(cube, dbHarness, 1, Long.MAX_VALUE, SyncLevel.FULL_SYNC);
        
DateTime now = new DateTime(DateTimeZone.UTC);
        
// Do an increment of 5 for a certain time and zipcode
cubeIo.writeSync(new LongOp(5), new WriteBuilder(cube)
        .at(time, now)
        .at(zipcode, "97201"));
        
// Do an increment of 10 for the same zipcode in a different hour of the same day
DateTime differentHour = now.withHourOfDay((now.getHourOfDay()+1)%24);
cubeIo.writeSync(new LongOp(10), new WriteBuilder(cube)
        .at(time, differentHour)
        .at(zipcode, "97201"));

// Read back the value that we wrote for the current hour, should be 5 
Optional<LongOp> thisHourCount = cubeIo.get(new ReadBuilder(cube)
         .at(time, HourDayMonthBucketer.hours, now)
        .at(zipcode, "97201"));
Assert.assertTrue(thisHourCount.isPresent());
Assert.assertEquals(5L, thisHourCount.get().getLong());
        
// Read back the value we wrote for the other hour, should be 10
Optional<LongOp> differentHourCount = cubeIo.get(new ReadBuilder(cube)
        .at(time, HourDayMonthBucketer.hours, differentHour)
        .at(zipcode, "97201"));
Assert.assertTrue(differentHourCount.isPresent());
Assert.assertEquals(10L, differentHourCount.get().getLong());

// The total for today should be the sum of the two increments
Optional<LongOp> todayCount = cubeIo.get(new ReadBuilder(cube)
        .at(time, HourDayMonthBucketer.days, now)
        .at(zipcode, "97201"));
Assert.assertTrue(todayCount.isPresent());
Assert.assertEquals(15L, todayCount.get().getLong());

Quickstart

Add datacube to your maven build. (TODO upload to a public repo)

Figure out your dimensions. These are the attributes of your incoming data points. Some examples of dimensions are time, latitude, and browser version. Create one Dimension object for each dimension. Use these dimensions to instantiate a data cube

You can skip using an IdService for now. This is an optional optimization for dimensions have that have long coordinates with low cardinality. For example, if you have a "country" dimension, the country name might be dozens of characters long, but there are only a few bytes of entropy. You could assign integers to countries and only use a few bytes to represent a country coordinate.

Create one Rollup object for each kind of counter you want to keep. For example, if you want to keep a counter of web hits by (time,browser), this would be one Rollup object.

Create a DbHarness object that will handle writing to the database. Currently, only HBaseDbHarness exists.

Create a DataCubeIo object, passing your DataCube object and your DbHarness.

Insert data points into your cube by passing them to DataCubeIo.writeSync().

Read back your rollup values by calling DataCubeIo.get().

Building

The POM is configured to build with specific versions of HBase and Hadoop. If your versions differ from those in the POM, you can override the versions by passing hbaseVersion and hadooopVersion. For example:

$ mvn -DhbaseVersion=0.90.6 -DhadoopVersion=0.20.2 package

The build artifact jars each have a classifier of the form hbase${hbaseVersion}-hadoop${hadoopVersion}, so you can depend on them in another project by doing something like:

<dependency>
  <groupId>com.urbanairship</groupId>
  <artifactId>datacube</artifactId>
  <version>${datacube.version}</version>
  <classifier>hbase0.94.0-hadoop1.0.3</classifier>
</dependency>

The main build artifact jar (without a classifer) uses the default HBase and Hadoop versions, which may change between datacube releases.

You can pass -DhadoopVersion and -DhbaseVersion to maven to choose which version of Haodop and HBase to depend on. Hadoop 2 is not yet supported since the artifact names are different. For example:

To build against your own version of HBase or Hadoop, just add your repository to the POM and pass -DhbaseVersion or -DhadoopVersion to the datacube build.

More Repositories

1

ios-library

Urban Airship iOS SDK
Swift
474
star
2

ruby-library

A Ruby wrapper for the Urban Airship API.
Ruby
201
star
3

urbanairship-cordova

Urban Airship integration with Apache Cordova
JavaScript
142
star
4

frock

A plugin-based tool for running fake HTTP and socket services
JavaScript
140
star
5

techladder

Job titles and descriptions for UA engineering
Makefile
113
star
6

android-library

Urban Airship Android SDK
Kotlin
109
star
7

react-native-airship

Airship React Native module
TypeScript
87
star
8

python-library

A Python library for using the Airship APIs for push notifications, reporting and other message types
Python
82
star
9

ios-phonegap-plugin

This repo is deprecated. Please see: https://github.com/urbanairship/phonegap-ua-push
JavaScript
69
star
10

statshtable

Java
45
star
11

drive.js

Drive.js makes testing JavaScript applications easy, wherever they live
JavaScript
43
star
12

php-library2

Urban Airship PHP Library
PHP
38
star
13

android-samples

DEPRECATED - Use https://github.com/urbanairship/android-library instead
Java
33
star
14

java-library

Java client library for the Urban Airship API
Java
32
star
15

mvn-python-packaging

A code sample and article describing how maven can be used to build python packages.
Python
32
star
16

beanstalk-munin

Munin plugins for Beanstalkd written in Python
Python
30
star
17

ios-samples

A collection of samples using the Urban Airship ios-library
Objective-C
26
star
18

aerostat

Cloud Naming Client/Service
Python
25
star
19

hbackup

Java
22
star
20

go-iapclient

A Google Cloud Platform Identity Aware Proxy authentication library for Golang
Go
20
star
21

airship-flutter

Flutter integration
Dart
17
star
22

ua-unity-plugin

C#
15
star
23

ua-ios-pod

a project for staging, testing and managing Urban Airship's cocoapod situation
Objective-C
15
star
24

pubsubsuperfeedr

Simple Python library to interface with Superfeedr's pubsubhubbub interface.
Python
10
star
25

titanium-module

Urban Airship Titanium module
Swift
9
star
26

urbanairship-xamarin

Urban Airship Xamarin Libraries
C#
9
star
27

node-connect-client

A node client for connecting to the Urban Airship Connect API.
JavaScript
8
star
28

passtools-api

Official documentation and resources for the PassTools API
8
star
29

pykafkap

Python Kafka Producer
Python
7
star
30

connect-java-library

Java client library for Urban Airship Connect
Java
6
star
31

objectstate

Streaming state management for node and the browser
JavaScript
6
star
32

django-subforms

An implementation of Hierarchical forms and form-to-model mapping.
Python
5
star
33

django-finial

Hierarchical template overriding on a per request basis.
Python
5
star
34

bb-push-sample-os5

BlackBerry push notification sample application, OS 5.0 and up
Java
5
star
35

android-gimbal-adapter

The Android Gimbal adapter for Urban Airship
Java
5
star
36

django-mithril

IP (v4 and v6) Whitelisting for Django
Python
5
star
37

connect-json-schema

JSON Schema for connect events
4
star
38

bb-push-sample-os4

BlackBerry push notification sample application, OS 4.x compatible
Java
4
star
39

passtools-ruby

Official Ruby SDK for the PassTools API
Ruby
4
star
40

kiddiepool

Python Client Driver Pool
Python
4
star
41

tap-set-exit

Sets the correct exit code based on tap output.
JavaScript
4
star
42

dedupe-stream

A stream that prevents consecutive duplicate emissions.
JavaScript
3
star
43

sarlacc-pit

Library for maintaining a complete in-memory copy of a remote data source.
Java
3
star
44

android-automator-utils

This project contains utilities that are used for UI Automation testing.
Java
3
star
45

google-v3-geocoder

Geokit::Geocoder subclass for interacting with google's v3 geocoder
3
star
46

ios-gimbal-adapter

The iOS Gimbal adapter for Urban Airship
Swift
3
star
47

ua-api-utils

Urban Airship API Utilities
Python
2
star
48

frock-middleware-delay

Delay middleware for frock. Introduce some latency.
JavaScript
2
star
49

passtools-java

Official Java SDK for the PassTools API
Java
2
star
50

object-cursor-stream

de-duped keypath stream
JavaScript
2
star
51

frock-static

A static file server for frock
JavaScript
2
star
52

apptimize-ios-kit

Swift
2
star
53

passtools-python

Official Python SDK for the PassTools API
Python
2
star
54

api-specs

OpenAPI specs for Airship APIs
2
star
55

logbot

IRC Logging Bot
Python
2
star
56

frock-proxy

A proxy plugin for frock
JavaScript
2
star
57

accengage-mobile-react-native

React Native module wrapping Android and iOS SDKs
Objective-C
2
star
58

padlocker-client

Encryption key dispersion service - Client
Python
2
star
59

cordova-gimbal-bridge

Gimbal Phongap plugin that uses the GimbalAdapter
Objective-C
2
star
60

streaming-set

A stream which adds objects written to it to a set, and then emits the updated array.
JavaScript
2
star
61

mongotools

Python tools for working with MongoDB
Python
1
star
62

connect-python-library

Python client library for Urban Airship Connect
Python
1
star
63

apptimize-flutter

Apptimize Flutter Plugin for iOS and Android
Dart
1
star
64

urbanairship-accengage-cordova

DEPRECATED
1
star
65

accengage-ios-sdk-releases

The Accengage iOS Framework Releases
1
star
66

capacitor-airship

TypeScript
1
star
67

cordova-gimbal-sdk

Objective-C
1
star
68

ios-pod-specs

Repository for Accengage CocoaPods specifications.
Ruby
1
star
69

android-ga-tracker

Android extension to the Google Analytics SDK
Java
1
star
70

react-native-gimbal-adapter

Objective-C
1
star
71

apptimize-qa-console-ios

Preview variants in different combinations from all of your active feature flags and experiments.
Swift
1
star
72

airship-mobile-framework-proxy

Kotlin
1
star
73

accengage-android-sdk-samples

Java
1
star
74

react-native-acc-fcm

Accengage FCM Plugin for React Native
Java
1
star
75

reach-python-library

Python client library for Urban Airship Reach
Python
1
star
76

airship-dotnet

C#
1
star
77

break-the-intranet

Hack week / Intranet
JavaScript
1
star
78

apptimize-qa-console-android

Java
1
star
79

accengage-android-lib-sample

Java
1
star
80

analytics-swift-apptimize

Swift
1
star
81

cordova-firebase-app-init

Java
1
star
82

ua-extensions

Urban Airship extensions repository houses drop in classes that provide additional functionality for the Urban Airship SDK
Objective-C
1
star
83

padlocker-server

Encryption key dispersion service - Server
JavaScript
1
star
84

ios-segment-integration

A simple way to integrate Urban Airship and Segment.
Objective-C
1
star