• Stars
    star
    482
  • Rank 87,600 (Top 2 %)
  • Language
    Scala
  • Created over 12 years ago
  • Updated almost 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A Scala productivity framework for Hadoop.

Welcome!

Build Status Hadoop MapReduce is awesome, but it seems a little bit crazy when you have to write this to count words. Wouldn't it be nicer if you could simply write what you want to do:

import Scoobi._, Reduction._

val lines = fromTextFile("hdfs://in/...")

val counts = lines.mapFlatten(_.split(" "))
               .map(word => (word, 1))
               .groupByKey
               .combine(Sum.int)

counts.toTextFile("hdfs://out/...", overwrite=true).persist(ScoobiConfiguration())

This is what Scoobi is all about. Scoobi is a Scala library that focuses on making you more productive at building Hadoop applications. It stands on the functional programming shoulders of Scala and allows you to just write what you want rather than how to do it.

Scoobi is a library that leverages the Scala programming language to provide a programmer friendly abstraction around Hadoop's MapReduce to facilitate rapid development of analytics and machine-learning algorithms.

Install

See the install instructions in the QuickStart section of the User Guide.

Features

  • Familiar APIs - the DList API is very similar to the standard Scala List API

  • Strong typing - the APIs are strongly typed so as to catch more errors at compile time, a major improvement over standard Hadoop MapReduce where type-based run-time errors often occur

  • Ability to parameterise with rich data types - unlike Hadoop MapReduce, which requires that you go off implementing a myriad of classes that implement the Writable interface, Scoobi allows DList objects to be parameterised by normal Scala types including value types (e.g. Int, String, Double), tuple types (with arbitrary nesting) as well as case classes

  • Support for multiple types of I/O - currently built-in support for text, Sequence and Avro files with the ability to implement support for custom sources/sinks

  • Optimization across library boundaries - the optimiser and execution engine will assemble Scoobi code spread across multiple software components so you still keep the benefits of modularity

  • It's Scala - being a Scala library, Scoobi applications still have access to those precious Java libraries plus all the functional programming and concise syntax that makes developing Hadoop applications very productive

  • Apache V2 licence - just like the rest of Hadoop

Getting Started

To get started, read the getting started steps and the section on distributed lists. The remaining sections in the User Guide provide further detail on various aspects of Scoobi's functionality.

The user mailing list is at http://groups.google.com/group/scoobi-users. Please use it for questions and comments!

Community

More Repositories

1

rng

Pure-functional random value generation
Scala
116
star
2

cesium-vr

Plugin for Cesium web-based virtual globe software to support the Oculus VR headset
JavaScript
80
star
3

revrand

A library of scalable Bayesian generalised linear models with fancy features
Python
58
star
4

cesium-groundpush-plugin

JavaScript
56
star
5

cplusplus-th

C++ Foreign Import Generation
Haskell
34
star
6

xsharpx

XSharpX is a general library for functional programming using .NET languages.
C#
29
star
7

stateline

Distributed Markov Chain Monte Carlo
C++
28
star
8

cesium-simple-photogrammetry

From photos of a real object to web based 3D visualisation in Cesium virtual globe.
JavaScript
25
star
9

SmartGridToolbox

Smart Grid Simulation Library (C++14)
24
star
10

nicta-ner

NICTA Named Entity Recogniser is a rule based Named Entity Recogniser which extracts named entities from text such as Organisation, Location and Person names. It is written in Java.
Java
16
star
11

pyairports

Python module for airport codes
Python
14
star
12

MLSS

Machine Learning Summer School
Python
14
star
13

TacoPig

MATLAB
13
star
14

t3as-pat-clas

Web services and associated functionality for doing Patent Classification Search and Lookups of the CPC, IPC, and USPC patent classification systems.
HTML
12
star
15

dora

Nonparametric Active Sampling
Python
11
star
16

obsidian

Probabilistic multi-sensor geophysical inversions on clusters
C++
10
star
17

fsdf-hackfest-cordova-leaflet

A Cordova/Leaflet template project
JavaScript
10
star
18

fsdf-hackfest

Foundation Spatial Data Framework Hackfest
Python
10
star
19

t3as-redact

PDF redaction: RESTful web service and HTML5 user interface
JavaScript
9
star
20

iris-reasoner

Clone of iris-reasoner (http://iris-reasoner.org) from sourceforge
Java
9
star
21

t3as-snomedct-service

A library, web service, and demo web user interface to analyse blocks of clinical text and pick out all SNOMED CT concepts.
Java
8
star
22

protobuf-native

Protocol Buffers via C++
Haskell
7
star
23

l4v

development version of seL4 proofs
Isabelle
7
star
24

serene-python-client

Python client for the Serene Data Integration software
Python
6
star
25

linearizedGP

Gaussian processes with general nonlinear likelihoods using the unscented transform or Taylor series linearisation.
Python
6
star
26

geodetic

Geodetic calculations including Vincenty and Great Circle using a Latitude and Longitude pair.
Haskell
6
star
27

trackfunction

A Scala library to track the argument and results of a function
Scala
5
star
28

serene

Serene Data Integration Platform
HTML
5
star
29

cesium-blockworld

JavaScript
5
star
30

t3as-pdf

Extensions to itext to support PDF redaction
Scala
4
star
31

openboard

4
star
32

text1

Non-empty text
Haskell
3
star
33

pod-detection

POD-Detection is a non-obtrusive error detection tool for rolling upgrade.
Ruby
3
star
34

docktimizer

Java
3
star
35

seL4

development version of the seL4 kernel
C
2
star
36

postmarkapp-client

Scala email sending client for Postmark HTTP API
Scala
2
star
37

scoobi-kiji

DataSources and DataSinks to read and write Kiji tables as Scoobi DLists
Scala
2
star
38

clouddb-replication

A workload generator specifically for benchmarking database replication delay and others
Java
2
star
39

fp-principles

Slides used for a screencast of Functional Programming Principles for Practitioners
Shell
2
star
40

Envirohack

Environmental geospatail hackfest run by the Office of Spatial Policy and NICTA
1
star
41

u-boot-sabre

u-boot for the sabre lite board, with patches pre-applied
C
1
star
42

nationalmap.nicta.com.au

Web page at nationalmap.nicta.com.au
CSS
1
star
43

stateline-cpp

C++ worker interface to Stateline
C++
1
star
44

sjmp

Secure Java Multiple-Precision library
Java
1
star
45

sk-config

separation kernel config tool for seL4
Haskell
1
star
46

TerriaJS-app

A simple example application built on TerriaJS
1
star
47

etd-retreat-sep2013

Coding exercises for ETD retreat September 2013
1
star
48

TrackAssist

Java
1
star
49

digit

A data-type representing digits 0-9 and other combinations
1
star
50

groundwater-viz-help

HTML
1
star