• Stars
    star
    1,041
  • Rank 44,255 (Top 0.9 %)
  • Language
    Scala
  • License
    Other
  • Created about 9 years ago
  • Updated almost 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Project SnappyData - memory optimized analytics database, based on Apache Spark™ and Apache Geode™. Stream, Transact, Analyze, Predict in one cluster
This repository is provided for legacy users and informational purposes only. It may contain security vulnerabilities in the code itself or its dependencies. TIBCO provides no updates, including security updates, to this code. Consistent with the terms of the Apache License 2.0 that apply to the TIBCO code in this repository, the code is provided on an "as is" basis, without any warranties or conditions of any kind and in no event and under no legal theory shall TIBCO be liable to you for damages arising as a result of the use or inability to use the code.

Introduction

SnappyData (aka TIBCO ComputeDB) is a distributed, in-memory optimized analytics database. SnappyData delivers high throughput, low latency, and high concurrency for unified analytics workload. By fusing an in-memory hybrid database inside Apache Spark, it provides analytic query processing, mutability/transactions, access to virtually all big data sources and stream processing all in one unified cluster.

One common use case for SnappyData is to provide analytics at interactive speeds over large volumes of data with minimal or no pre-processing of the dataset. For instance, there is no need to often pre-aggregate/reduce or generate cubes over your large data sets for ad-hoc visual analytics. This is made possible by smartly managing data in-memory, dynamically generating code using vectorization optimizations and maximizing the potential of modern multi-core CPUs. SnappyData enables complex processing on large data sets in sub-second timeframes.

SnappyData Positioning

!!!Note SnappyData is not another Enterprise Data Warehouse (EDW) platform, but rather a high performance computational and caching cluster that augments traditional EDWs and data lakes.

Important Capabilities

  • Easily discover and catalog big data sets
    You can connect and discover datasets in SQL DBs, Hadoop, NoSQL stores, file systems, or even cloud data stores such as S3 by using SQL, infer schemas automatically and register them in a secure catalog. A wide variety of data formats are supported out of the box such as JSON, CSV, text, Objects, Parquet, ORC, SQL, XML, and more.

  • Rich connectivity
    SnappyData is built with Apache Spark inside. Therefore, any data store that has an Apache Spark connector can be accessed using SQL or by using the Apache Spark RDD/Dataset API. Virtually all modern data stores do have Apache Spark connector. See Apache Spark Packages. You can also dynamically deploy connectors to a running SnappyData cluster.

  • Virtual or in-memory data
    You can decide which datasets need to be provisioned into distributed memory or left at the source. When the data is left at source, after being modeled as a virtual/external tables, the analytic query processing is parallelized, and the query fragments are pushed down wherever possible and executed at high speed. When speed is essential, applications can selectively copy the external data into memory using a single SQL command.

  • In-memory Columnar + Row store
    You can choose in-memory data to be stored in any of the following forms:

    • Columnar: The form that is compressed and designed for scanning/aggregating large data sets.
    • Row store: The form that has an extremely fast key access or highly selective access. The columnar store is automatically indexed using a skipping index. Applications can explicitly add indexes for the row store.
  • High performance
    When data is loaded, the engine parallelizes all the accesses by carefully taking into account the available distributed cores, the available memory, and whether the source data can be partitioned to deliver extremely high-speed loading. Therefore, unlike a traditional warehouse, you can bring up SnappyData whenever required, load, process, and tear it down. Query processing uses code generation and vectorization techniques to shift the processing to the modern-day multi-core processor and L1/L2/L3 caches to the possible extent.

  • Flexible rich data transformations
    External data sets when discovered automatically through schema inference will have the schema of the source. Users can cleanse, blend, reshape data using a SQL function library (Apache Spark SQL+) or even submit Apache Spark jobs and use custom logic. The entire rich Apache Spark API is at your disposal. This logic can be written in SQL, Java, Scala, or even Python.*

  • Prepares data for data science
    Through the use of apache Apache Spark API for statistics and machine learning, raw or curated datasets can be easily prepared for machine learning. You can understand the statistical characteristics such as correlation, independence of different variables and so on. You can generate distributed feature vectors from your data that is by using processes such as one-hot encoder, binarizer, and a range of functions built into the Apache Spark ML library. These features can be stored back into column tables and shared across a group of users with security and avoid dumping copies to disk, which is slow and error-prone.

  • Stream ingestion and liveness
    While it is common to see query service engines today, most resort to periodic refreshing of data sets from the source as the managed data cannot be mutated — for example query engines such as Presto, HDFS formats like parquet, etc. Moreover, when updates can be applied pre-processing, re-shaping of the data is not necessarily simple. In SnappyData, operational systems can feed data updates through Kafka to SnappyData. The incoming data can be CDC(Change-data-capture) events (insert, updates, or deletes) and can be easily ingested into in-memory tables with ease, consistency, and exactly-once semantics. The Application can apply custom logic to do sophisticated transformations and get the data ready for analytics. This incremental and continuous process is far more efficient than batch refreshes. Refer Stream Processing with SnappyData

  • Approximate Query Processing(AQP)
    When dealing with huge data sets, for example, IoT sensor streaming time-series data, it may not be possible to provision the data in-memory, and if left at the source (say Hadoop or S3) your analytic query processing can take too long. In SnappyData, you can create one or more stratified data samples on the full data set. The query engine automatically uses these samples for aggregation queries, and a nearly accurate answer returned to clients. This can be immensely valuable when visualizing a trend, plotting a graph or bar chart. Refer AQP.

  • Access from anywhere
    You can use JDBC, ODBC, REST, or any of the Apache Spark APIs. The product is fully compatible with Apache Spark 2.1.1. SnappyData natively supports modern visualization tools such as TIBCO Spotfire, Tableau, and Qlikview. Refer

Downloading and Installing SnappyData

You can download and install the latest version of SnappyData from github. Refer to the documentation for installation steps.

Getting Started

Multiple options are provided to get started with SnappyData. Easiest way to get going with SnappyData is on your laptop. You can also use any of the following options:

  • On-premise clusters

  • AWS

  • Docker

  • Kubernetes

You can find more information on options for running SnappyData here.

Quick Test to Measure Performance of SnappyData vs Apache Spark

If you are already using Apache Spark, you can experience upto 20x speedup for your query performance with SnappyData. Try this test using the Spark Shell.

Documentation

To understand SnappyData and its features refer to the documentation.

Other Relevant content

  • Paper on Snappydata at Conference on Innovative Data Systems Research (CIDR) - Info on key concepts and motivating problems.
  • Another early Paper that focuses on overall architecture, use cases, and benchmarks. ACM Sigmod 2016.
  • TPC-H benchmark comparing Apache Spark with SnappyData
  • Checkout the SnappyData blog for developer content
  • TIBCO community page for the latest info.

Community Support

We monitor the following channels comments/questions:

Link with SnappyData Distribution

Using Maven Dependency

SnappyData artifacts are hosted in Maven Central. You can add a Maven dependency with the following coordinates:

groupId: io.snappydata
artifactId: snappydata-cluster_2.11
version: 1.3.1

Also add cloudera repository to the set of Maven repositories to be searched:

  <repositories>
    <repository>
      <id>cloudera-repo</id>
      <name>cloudera repo</name>
      <url>https://repository.cloudera.com/artifactory/cloudera-repos</url>
    </repository>
    ...
  </repositories>

Using Gradle Dependency

If you are using Gradle, add this to your build.gradle for core SnappyData artifacts:

dependencies {
  implementation 'io.snappydata:snappydata-core_2.11:1.3.1'
  ...
}

For additions related to SnappyData cluster, use:

dependencies {
  implementation 'io.snappydata:snappydata-cluster_2.11:1.3.1'
  ...
}

Also add cloudera repository to the set of Maven repositories to be searched:

repositories {
  mavenCentral()
  maven { url 'https://repository.cloudera.com/artifactory/cloudera-repos' }
  ...
}

Using SBT Dependency

If you are using SBT, add this line to your build.sbt for core SnappyData artifacts:

libraryDependencies += "io.snappydata" % "snappydata-core_2.11" % "1.3.1"

For additions related to SnappyData cluster, use:

libraryDependencies += "io.snappydata" % "snappydata-cluster_2.11" % "1.3.1"

Also add cloudera repository to the set of Maven repositories to be searched:

resolvers += "Cloudera Repo" at "https://repository.cloudera.com/artifactory/cloudera-repos"

You can find more specific SnappyData artifacts here

!!!Note If your project fails when resolving the above dependency (that is, it fails to download javax.ws.rs#javax.ws.rs-api;2.1), it may be due an issue with its pom file.
As a workaround, you can add the below code to your build.sbt:

val workaround = {
  sys.props += "packaging.type" -> "jar"
  ()
}

For more details, refer sbt/sbt#3618.

Building from Source

If you would like to build SnappyData from source, refer to the documentation on building from source.

How is SnappyData Different than Apache Spark?

Apache Spark is a general purpose parallel computational engine for analytics at scale. At its core, it has a batch design center and is capable of working with disparate data sources. While this provides rich unified access to data, this can also be quite inefficient and expensive. Analytic processing requires massive data sets to be repeatedly copied and data to be reformatted to suit Apache Spark. In many cases, it ultimately fails to deliver the promise of interactive analytic performance. For instance, each time an aggregation is run on a large Cassandra table, it necessitates streaming the entire table into Apache Spark to do the aggregation. Caching within Apache Spark is immutable and results in stale insight.

The SnappyData Approach

Snappy Architecture

SnappyData Architecture

SnappyData takes a different approach. SnappyData fuses a low latency, highly available in-memory transactional database ((Pivotal GemFire/Apache Geode) into Apache Spark with shared memory management and optimizations. Data can be managed in columnar form similar to Apache Spark caching or in a row oriented manner commonly used in popular relational databases like postgres). But, many query engine operators are significantly more optimized through better vectorization, code generation and indexing.
The net effect is, an order of magnitude performance improvement when compared to native Apache Spark caching, and more than two orders of magnitude better performance when Apache Spark is used in conjunction with external data sources. Apache Spark is turned into an in-memory operational database capable of transactions, point reads, writes, working with Streams (Apache Spark) and running analytic SQL queries without losing the computational richness in Apache Spark.

Streaming Example - Ad Analytics

Here is a stream + Transactions + Analytics use case example to illustrate the SQL as well as the Apache Spark programming approaches in SnappyData - Ad Analytics code example. Here is a screencast that showcases many useful features of SnappyData. The example also goes through a benchmark comparing SnappyData to a Hybrid in-memory database and Cassandra.

Contributing to SnappyData

If you are interested in contributing, please visit the community page for ways in which you can help.

More Repositories

1

flogo

Project Flogo is an open source ecosystem of opinionated event-driven capabilities to simplify building efficient & modern serverless functions, microservices & edge apps.
CSS
2,339
star
2

jasperreports

JasperReports® - Free Java Reporting Library
Java
839
star
3

js-docker

Container deployment of TIBCO JasperReports® Server
Smarty
144
star
4

jaspersoft-studio-ce

Jaspersoft Studio (Community Edition)
Java
115
star
5

mashling

Project Mashling
Go
86
star
6

snappy-on-k8s

An Integrated and collaborative cloud environment for building and running Spark applications on PKS/Kubernetes
Shell
79
star
7

bw6-plugin-maven

Plug-in Code for Apache Maven and TIBCO ActiveMatrix BusinessWorks™
Java
65
star
8

flogo-contrib

Flogo Contribution repo. Contains activities, triggers, models and actions.
Go
64
star
9

flogo-cli

Project Flogo Command Line Interface
Go
57
star
10

spotfire-mods

Spotfire Mods by TIBCO Spotfire®
TypeScript
50
star
11

js-visualize

Click here for live samples to quickly embed analytics with Visualize.js...
JavaScript
50
star
12

flogo-lib

Project Flogo Library
Go
49
star
13

jasperreports-server-ce

JasperReports® Server Community Edition
Java
41
star
14

snappy-examples

Use cases built on SnappyData. Use cases contained here: 1. Ad Analytics 2. Streaming data ingestion from RabbitMQ.
Scala
32
star
15

flogo-services

Project Flogo Services
Go
30
star
16

tci-flogo

Contributions and tutorials for TIBCO Cloud™ Integration - Develop powered by open source Project Flogo™
TypeScript
24
star
17

labs-air

TIBCO LABS™ Project AIR - Documentation
CSS
23
star
18

bwce-docker

Scripts for customizing Docker image for TIBCO BusinessWorks™ Container Edition
Shell
21
star
19

tgdb-client

Graph Database Client
Go
21
star
20

TIBCO-LABS

TIBCO LABS™ Initiative, the power of the future today. This is a program designed to provide customers and partners with a mechanism for actively participating in TIBCO’s history of innovation.
JavaScript
19
star
21

JS-FDSample

Sample web application demonstrating TIBCO Jaspersoft. Try it live at:
CSS
18
star
22

apiscout

TIBCO API Scout, Finding your APIs in Kubernetes so you know what you deployed last summer!
CSS
18
star
23

mashling-recipes

Recipes for Project Mashling
Go
14
star
24

cic-cli-main

Command Line Interface for TIBCO Cloud Platform Capabilities
TypeScript
13
star
25

dovetail

Dovetail blockchain ecosystem docs repository
Shell
13
star
26

snappy-cloud-tools

SnappyData Cloud Utilities
Python
13
star
27

TIBCO-Messaging

TIBCO Messaging
Java
12
star
28

spotfire-python

Package for Building Python Extensions to TIBCO Spotfire®
Python
11
star
29

Augmented-Reality

TIBCO LABS™ Project ART
C#
11
star
30

spotfire-cloud-deployment-kit

Vanilla recipes to build container images and Helm charts for TIBCO Spotfire®
Smarty
11
star
31

cic-cli-core

CLI Core for TIBCO CLOUD™ CLI Plugins
TypeScript
11
star
32

TCSTK-Angular

TIBCO Cloud™ Composer - Angular Libraries
TypeScript
11
star
33

tci-awesome

Showcase of awesome projects for TCI.
11
star
34

businessworks-awesome

Awesome projects and tools for TIBCO BusinessWorks
11
star
35

js-workshops

Follow this "Urban Roast" demo tutorial series to power your app with Visualize.js!
JavaScript
11
star
36

bw-samples

Samples and Open API's for TIBCO BusinessWorks Platform
Java
10
star
37

tibco-streaming-samples

TIBCO Streaming samples, showing various aspects of Streaming functionality.
Java
10
star
38

bw-tooling

Collection of tools designed to simplify deployment and management of TIBCO BusinessWorks applications
Java
10
star
39

genxdm

GenXDM: XQuery/XPath Data Model API, bridges, and processors for tree-model neutral access to XML.
Java
9
star
40

be-tools

Collection of tools to work with TIBCO BusinessEvents
Shell
8
star
41

cic-cli-plugin-asyncapi

CLI Plugin for transforming AsyncAPI specs into flogo templates
TypeScript
8
star
42

vscode-extension-tci

TIBCO Cloud Integration Node.js Tools for VS Code
JavaScript
8
star
43

SpotfireDockerScripts

Provide Scripts for Containerizing Spotfire Server Components
Dockerfile
8
star
44

cic-cli-plugin-tcam

TypeScript
8
star
45

spotfire-wrapper

An Angular component packaged as custom elements that defined a new HTML element to display a Spotfire dashboard in a framework-agnostic way
TypeScript
8
star
46

bwce-buildpack

Buildpack Scripts for TIBCO BusinessWorks™ Container Edition
Shell
7
star
47

TCSTK-Cloud-CLI

TIBCO Cloud™ Composer - Command Line Interface
TypeScript
7
star
48

ASAssets_Utilities

Java
6
star
49

ASAssets_DataAbstractionBestPractices

6
star
50

JoomlaAdapter-legacy

Adapter Code for TIBCO API Exchange and Joomla!
JavaScript
6
star
51

labs-air-ui

TIBCO LABS™ Project AIR - User Interface
JavaScript
6
star
52

snappy-zeppelin-interpreter

Snappydata interpreter for Apache Zeppelin
Java
5
star
53

vulnrep

Vulnerability Report Library for parsing and conversion of software vulnerability formats, including CVRF and CSAF
Go
5
star
54

dovetail-contrib

Go
5
star
55

labs-graphbuilder-contrib

TIBCO LABS™ Project GraphBuilder
TypeScript
5
star
56

ASAssets_CacheManagement

5
star
57

tibco-streaming-community

TIBCO(r) Streaming Community Samples Repository
Java
5
star
58

tcapim-graphql-proxy

Proxy Server for enforcing GraphQL Policies on TIBCO Cloud™ API Management
JavaScript
5
star
59

jrio-docker

JasperReports IO Dockerfile
Dockerfile
4
star
60

snappy-benchmarking

Scala
4
star
61

labs-discover

TIBCO LABS™ Project Discover, Business Process Mining
TypeScript
4
star
62

as2-jdbc

JDBC for TIBCO ActiveSpaces
Java
4
star
63

TCSTK-case-manager-app

TIBCO Cloud™ Composer Pattern App - Case Manager App
TypeScript
4
star
64

SDS-R-Connector

R Connector for TIBCO Spotfire® Data Science
C
3
star
65

dovetail-cli

Client for the generation and deployment of smart contracts
Go
3
star
66

labs-air-edgex

TIBCO LABS™ Project AIR - EdgeX baseline
Go
3
star
67

spotfire-mod-welllog

A visualization for TIBCO Spotfire® to display well logs on tracks and fill in between the lines to aid visualization and interpretation of the data.
JavaScript
3
star
68

labs-air-contrib

TIBCO LABS™ Project AIR - Extensions
Go
3
star
69

labs-air-services

TIBCO LABS™ Project AIR - Services
Go
3
star
70

TCSTK-component-schematics

TIBCO Cloud™ Composer - Component Schematics
TypeScript
3
star
71

lmi-spotfire-ds

The LMI Data Source for Spotfire creates a new data source type that connects to an LMI instance.
C#
3
star
72

PDTool

HTML
3
star
73

tibco-developer-hub

TypeScript
3
star
74

ebx-container-edition

Shell
3
star
75

bpme-samples

This repository includes samples focused on TIBCO BPM Enterprise capabilities, custom UI development, cloud-native deployment
JavaScript
2
star
76

fabrician-jruby-scripting-engine

TIBCO Silver Fabric scripting engine for JRuby.
2
star
77

TCSTK-base-app

TIBCO Cloud™ Composer Pattern App - Base App
JavaScript
2
star
78

tcapim-cli-plugin

CLI Plugin to create and manage TIBCO Cloud™ API Management applications.
JavaScript
2
star
79

PDToolRelease

2
star
80

vscode-extension-mashling

Project Mashling VSCode Extension
TypeScript
2
star
81

catalystml

CatalystML is an open source specification for real-time feature processing, purpose built to transform data for machine learning models.
Python
2
star
82

loglmi-appender

Java log appender for LogLogic
Java
2
star
83

lmi-flogo-collectors

Go
2
star
84

fabrician-hadoop-enabler

TIBCO Silver Fabric Enabler for Hadoop
Python
2
star
85

spotfire-quickstart

Spotfire quickstart templates for automatic deployment of TIBCO Spotfire® platform
HCL
2
star
86

platform-provisioner

Platform Provisioner by TIBCO®
Shell
2
star
87

TCSTK-analytics-app

TIBCO Cloud™ Composer Pattern App - Analytics App
TypeScript
2
star
88

ASAssets_KPI

Java
2
star
89

labs-air-charts

TIBCO LABS™ Project AIR - Charts
Shell
2
star
90

dovetail-java-lib

Java
1
star
91

bcce-tas-scripts

Shell
1
star
92

bw-sample-for-amazon-sns

Java
1
star
93

lmi-jdbc

JDBC driver for accessing data stored in LogLogic LMI
Java
1
star
94

labs-lightcrane-services

Python
1
star
95

JasperMobileSDK-ios

Objective-C
1
star
96

eftl

Go bindings for eFTL
Go
1
star
97

gateway-examples

Usage examples for the Gateway.
1
star
98

labs-lightcrane-contrib

Go
1
star
99

tibcli-node

Command line version of Visual Studio Code extension
JavaScript
1
star
100

lmi-aws-ec

JavaScript
1
star