• Stars
    star
    492
  • Rank 89,476 (Top 2 %)
  • Language
    Scala
  • License
    Apache License 2.0
  • Created over 9 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Stream Data Mining Library for Spark Streaming

streamDM for Spark Streaming

streamDM is a new open source software for mining big data streams using Spark Streaming, started at Huawei Noah's Ark Lab. streamDM is licensed under Apache Software License v2.0.

Big Data Stream Learning

Big Data stream learning is more challenging than batch or offline learning, since the data may not keep the same distribution over the lifetime of the stream. Moreover, each example coming in a stream can only be processed once, or they need to be summarized with a small memory footprint, and the learning algorithms must be very efficient.

Spark Streaming

Spark Streaming is an extension of the core Spark API that enables stream processing from a variety of sources. Spark is a extensible and programmable framework for massive distributed processing of datasets, called Resilient Distributed Datasets (RDD). Spark Streaming receives input data streams and divides the data into batches, which are then processed by the Spark engine to generate the results.

Spark Streaming data is organized into a sequence of DStreams, represented internally as a sequence of RDDs.

Included Methods

In this current release of StreamDM v0.2, we have implemented:

we also implemented following data generators:

  • HyperplaneGenerator
  • RandomTreeGenerator
  • RandomRBFGenerator
  • RandomRBFEventsGenerator

We have also implemented SampleDataWriter, which can call data generators to create sample data for simulation or test.

In the next release of streamDM, we are going to add:

  • Classification: Random Forests
  • Multi-label: Hoeffding Tree ML, Random Forests ML
  • Frequent Itemset Miner: IncMine

For future works, we are considering:

  • Regression: Hoeffding Regression Tree, Bagging, Random Forests
  • Clustering: Clustree, DenStream
  • Frequent Itemset Miner: IncSecMine

Going Further

For a quick introduction to running StreamDM, refer to the Getting Started document. The StreamDM Programming Guide presents a detailed view of StreamDM. The full API documentation can be consulted here.

Environment

  • Spark 2.3.2
  • Scala 2.11
  • SBT 0.13
  • Java 8+

Mailing lists

User support and questions mailing list:

[email protected]

Development related discussions:

[email protected]

More Repositories

1

Efficient-AI-Backbones

Efficient AI Backbones including GhostNet, TNT and MLP, developed by Huawei Noah's Ark Lab.
Python
4,021
star
2

HEBO

Bayesian optimisation & Reinforcement Learning library developped by Huawei Noah's Ark Lab
Jupyter Notebook
3,266
star
3

Pretrained-Language-Model

Pretrained language model and its related optimization techniques developed by Huawei Noah's Ark Lab.
Python
2,961
star
4

Efficient-Computing

Efficient computing methods developed by Huawei Noah's Ark Lab
Jupyter Notebook
1,116
star
5

AdderNet

Code for paper " AdderNet: Do We Really Need Multiplications in Deep Learning?"
Python
952
star
6

trustworthyAI

Trustworthy AI related projects
Python
949
star
7

SMARTS

Scalable Multi-Agent RL Training School for Autonomous Driving
Python
922
star
8

bolt

Bolt is a deep learning library with high performance and heterogeneous flexibility.
C++
896
star
9

noah-research

Noah Research
Python
867
star
10

vega

AutoML tools chain
Python
840
star
11

VanillaNet

Python
810
star
12

Speech-Backbones

This is the main repository of open-sourced speech technology by Huawei Noah's Ark Lab.
Jupyter Notebook
547
star
13

Pretrained-IPT

Python
406
star
14

xingtian

xingtian is a componentized library for the development and verification of reinforcement learning algorithms
Python
305
star
15

benchmark

HTML
274
star
16

Disout

Code for AAAI 2020 paper, Beyond Dropout: Feature Map Distortion to Regularize Deep Neural Networks (Disout).
Python
219
star
17

BGCN

A Tensorflow implementation of "Bayesian Graph Convolutional Neural Networks" (AAAI 2019).
Python
152
star
18

BHT-ARIMA

Code for paper: Block Hankel Tensor ARIMA for Multiple Short Time Series Forecasting (AAAI-20)
Python
97
star
19

multi_hyp_cc

[CVPR2020] A Multi-Hypothesis Approach to Color Constancy
Python
82
star
20

Efficient-NLP

Python
79
star
21

streamDM-Cpp

stream Machine Learning in C++
C++
68
star
22

Federated-Learning

Python
15
star