Realistic Hadoop Data Processing Examples

This code is to accompany my blog post on map reduce frameworks

The point of the code in this repository is to provide an implementation for a business question (listed below) in each of the major Map Reduce frameworks.

Each implementation will get it's own subdirectory with it's own build and running instructions. Each framework will also get an accompanying test, and an in-depth walkthrough about implementation details.

The following implementations are complete:

Java map reduce - walkthrough
Scoobi - walkthrough
Scalding - walkthrough
Cascading - walkthrough
Hive - walkthrough
Pig - walkthrough

The problem

The Data

We have two datasets: customers, and transactions.

Customer Fields:

id (1)
email ([email protected])
language (EN)
location (US)

Transaction Fields:

transaction-id (1)
product-id (1)
user-id (1)
purchase-amount (19.99)
product-description (a rubber chicken)

These two datasets are stored in tab-delimited files somewhere on HDFS.

The Question

For each product, we want to know the number of locations in which that product was purchased.

That's it!

In the real world, we might have other questions, like the number of purchases per location for each product.

rathboma/hadoop-framework-examples

rathboma

Reviews

Repository Details

Realistic Hadoop Data Processing Examples

The problem

The Data

The Question

More Repositories