• Stars
    star
    5
  • Rank 2,861,937 (Top 57 %)
  • Language
    Java
  • License
    Apache License 2.0
  • Created over 5 years ago
  • Updated about 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Upserts And Incremental Processing on Big Data

More Repositories

1

kafka

A high-throughput, distributed, publish-subscribe messaging system
Java
58
star
2

flink

Scalable Batch and Stream Data Processing
Java
24
star
3

fbtftp

fbtftp is Facebook's implementation of a dynamic TFTP server framework
Python
15
star
4

beam

Unified programming model to create a data processing pipelines for batch and streaming models
Python
9
star
5

caffe

Caffe: a fast open framework for deep learning
C++
7
star
6

mixer

Mixed Incremental Cross-Entropy REINFORCE ICLR 2016
Lua
7
star
7

fboss

Facebook Open Switching System Software for controlling network switches
C++
6
star
8

fasttext

Library for fast text representation and classification
HTML
6
star
9

felix

Project Calico core repository
Python
6
star
10

bistro

Bistro is a flexible distributed scheduler, a high-performance framework supporting multiple paradigms while retaining ease of configuration, management, and monitoring.
C++
6
star
11

heapster

Compute Resource Usage Analysis and Monitoring of Container Clusters
Go
5
star
12

helm

The Kubernetes Package Manager
Go
5
star
13

darkforestgo

DarkForest, the Facebook Go engine
C
5
star
14

charts

Curated applications for Kubernetes using Helm charts with integrated Deployment Manager templates
Shell
5
star
15

commai-env

A platform for developing AI systems as described in A Roadmap towards Machine Intelligence - http://arxiv.org/abs/1511.08130
Python
5
star
16

multipathnet

A Torch implementation of the object detection network from "A MultiPath Network for Object Detection" (https://arxiv.org/abs/1604.02135)
Lua
5
star
17

deepmask

Torch implementation of DeepMask and SharpMask
Lua
5
star
18

kubernetes

Production-Grade Container Scheduling and Management
Go
5
star
19

kubernetes-cluster-federation

Kubernetes cluster federation tutorial
Shell
5
star
20

drill

Schema-free SQL for Hadoop, NoSQL and Cloud Storage
Java
5
star
21

torch

A Scientific Computing Framework for Luajit
Jupyter Notebook
5
star
22

pysparnn

Approximate Nearest Neighbor Search for Sparse Data in Python
Python
4
star
23

avro

Apache Avro
Java
3
star
24

airflow

Apache Airflow
Python
3
star
25

horovod

Distributed training framework for TensorFlow, Keras, PyTorch, and MXNet.
Python
3
star
26

data-platform-ai

Data Platform for Large Scale Data Processing and AI & Machine Learning/Deep Learning
2
star
27

hive

Apache Hive
Java
2
star
28

presto

Distributed SQL query engine for big data https://prestodb.io
Java
2
star
29

protobuf

Protocol Buffers - Google's data interchange format
C++
2
star
30

sereal

Fast, compact, schema-less, binary serialization and deserialization oriented towards dynamic languages
C
2
star
31

kuryr

Container and Orchestration remote driver for OpenStack Neutron
Python
2
star
32

kafka-rest-node

Node.js client for the Kafka REST proxy
JavaScript
2
star
33

kafka-examples

Applications, templates and code examples for Apache Kafka
Java
2
star
34

spark

Apache Spark
Scala
2
star
35

druid

Apache Druid (Incubating) - Column oriented distributed data store ideal for powering interactive applications
Java
2
star
36

rokku

Rokku project. This projects acts as a proxy on top of any S3 storage solution providing services like authentication, authorisation, short-term tokens and lineage.
Scala
2
star
37

kafka-ansible

Ansible playbooks for the Kafka
Shell
1
star
38

workspaces

Workspaces 2.0 demo
Python
1
star
39

docker-spark

Apache Spark docker image
Shell
1
star
40

message-backbone

Message queue backbone for event handling
1
star
41

docker-xserver

Docker Image with Xserver, OpenBLAS and correct user settings
Shell
1
star
42

calcite

Apache Calcite
Java
1
star
43

kafka-connect-jdbc

Kafka Connect connector for JDBC-compatible databases
Java
1
star
44

kafka-connect-storage-common

Shared software among connectors that target distributed filesystems and cloud storage
Java
1
star
45

marmaray

Generic Data Ingestion & Dispersal Library for Hadoop
Java
1
star
46

gitolly

Clone all of your Github repositories from the command line using Python
Python
1
star
47

parquet-mr

Apache Parquet
Java
1
star
48

kafka-python

Kafka Python client
C
1
star
49

docker-torch-jupyter

Docker image for deep learning with Torch and Jupyter
1
star
50

kafka-connect-elasticsearch

Kafka Connect Elasticsearch connector
Java
1
star
51

manifests

Deploy manifests for the single-container version of Netsil AOC
Shell
1
star
52

pinot

Apache Pinot - A realtime distributed OLAP datastore
Java
1
star
53

kafka-connect-hdfs

Kafka Connect HDFS connector
Java
1
star
54

baker

Orchestrate microservice-based process flows
Scala
1
star
55

docker-deeplearning

Information and scripts to run and develop Deep Learning Docker containers
Shell
1
star
56

caffe2

Caffe2 is a lightweight, modular, and scalable deep learning framework
Jupyter Notebook
1
star
57

clocks

Time, Clocks, and the Ordering of Events
Python
1
star
58

kafka-rest-utils

Utilities and a small framework for building REST services with Jersey, Jackson, and Jetty.
Java
1
star
59

kafka-schema-registry

Schema registry for Kafka
Java
1
star
60

kafka-rest

Confluent REST Proxy for Kafka
Java
1
star
61

pyhive

Python interface to Hive and Presto.
Python
1
star
62

torch-nn

Efficient, reusable RNNs and LSTMs for torch
Lua
1
star
63

kafka-go

Kafka Golang client
Go
1
star
64

utils

CLI, Scripts, etc.
Python
1
star
65

xgboost

Scalable, portable, and distributed Gradient Boosting (GBDT, GBRT or GBM) library, for Python, R, Java, Scala, C++ and more. Runs on single host, Hadoop, Spark, Flink and DataFlow
C++
1
star
66

peloton

Unified Resource Scheduler to co-schedule mixed types of workloads such as batch, stateless and stateful jobs in a single cluster for better resource utilization.
Go
1
star
67

petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Python
1
star
68

arx

ARX is a comprehensive open source data anonymization tool aiming to provide scalability and usability. It supports various anonymization techniques, methods for analyzing data quality and re-identification risks and it supports well-known privacy models, such as k-anonymity, l-diversity, t-closeness and differential privacy.
Java
1
star
69

kafka-connect-storage-cloud

Kafka Connect suite of connectors for Cloud storage (currently including Amazon S3)
Java
1
star