• Stars
    star
    117
  • Rank 301,828 (Top 6 %)
  • Language
    Java
  • License
    Apache License 2.0
  • Created almost 11 years ago
  • Updated over 10 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

🐘 Source code for assignments of Udacity course "Introduction to Hadoop and MapReduce"

Introduction to Hadoop and MapReduce


Introduction

This repository contains source code for the assignments of Udacity's course, Introduction to Hadoop and MapReduce, which was unveiled on 15th November, 2013.
This is a short course by Cloudera guys in association with Udacity. Instructors for this course are Sarah Sproehnle and Ian Wrigley, both from Cloudera and Gundega Dekena, Course Developer is from Udacity.

Course does not mandate any programming language for writing Hadoop MapReduce jobs; but they have mainly used / taught Hadoop MapReduce jobs using Python [i.e. with Hadoop Streaming approach for running jobs] during the course.

I have developed Hadoop MapReduce code for the 2 problem statements [3 questions each] in 2 programming languages; Python as well as Java.

Instructions for Virtual Machine download and setup

Please refer instructions document provided by Course Instructors for details on the Hadoop Virtual Machine [VM henceforth] setup required for running these examples.
As mentioned in the above document, VM image with Hadoop installed and preconfigured, can be downloaded from Udacity CDN.

Please be forewarned, the size of this compressed VM archive is 1.7 GB. Also it does not uncompress with either 7-Zip or Windows default Zip utility. You might have to use WinRAR or WinZip or even Cygwin unzip to uncompress the same, if you are on a Windows platform. On other Operating Systems, probably unzip command might work just fine. Uncompressed size of this VM is 4.2 GB.

Credentials to login to this Virtual Machine are: training / training. You will not need root access for any of the assignments of this Course. But just in case if you need, the password for root is training.

Please ensure that you configure the VM to at least 1.5 GB of RAM in VMware Player. It might run much better with 2 GB though. I have used VMware Player v5.0.2, the current latest version as of this writing [i.e. 28th November, 2013] is v6.0.1.

Data

Input Files

Input files for the problem statements ProblemStatement#1 and ProblemStatement#2 have also been uploaded to GitHub.

Update at 11/27/2013 10:00:26 PM IST: Had to remove these input files from the repo as the GitHub Windows client is not able to sync the repo [or rather getting badly stuck with illegitimate alphabets] with these compressed archives.

These input compressed archives can also be downloaded from Udacity servers. Please check here for input file for Problem Statement 1 and here for Problem Statement 2.
These links are also mentioned in the instructions document provided by Udacity Course Instructors.

Output Files

Output for the problem statements ProblemStatement#1 and ProblemStatement#2 have also been uploaded to this GitHub repo for quick reference and validation of the output.
This output is the Hadoop MR Job output which is obtained after processing and analyzing the specific question.

Problem Statement1

Execution steps are also documented for running the following in either Python or Java.

Question#1

Instead of breaking the sales down by store, instead give us a sales breakdown by product category across all of our stores.

  1. What is the value of total sales for the following categories?
    • Toys
    • Consumer Electronics

Code

Java variant

P1Q1.java

Python variant

P1Q1_Mapper.py and P1Q1_Reducer.py

Solution

Please check pur_p1q1.tsv for the output of this problem statement.

Execution Log files

Please check pur_p1q1.txt and pur_p1q1.txt for command line execution log files of Java and Python respectively.

Question#2

Find the monetary value for the highest individual sale for each separate store.

  1. What are the values for the following stores?
    • Reno
    • Toledo
    • Chandler

Code

Java variant

P1Q2.java

Python variant

P1Q2_Mapper.py and P1Q2_Reducer.py

Solution

Please check pur_p1q2.tsv for the output of this problem statement.

Execution Log files

Please check pur_p1q2.txt and pur_p1q2.txt for command line execution log files of Java and Python respectively.

Question#3

Find the total sales value across all the stores, and the total number of sales. Assume there is only one reducer.

  1. Find
    • Total sales value across all the stores
    • Total number of sales

Code

Java variant

P1Q3.java

Python variant

P1Q3_Mapper.py and P1Q3_Reducer.py

Solution

Please check pur_p1q3.tsv for the output of this problem statement.

Execution Log files

Please check pur_p1q3.txt and pur_p1q3.txt for command line execution log files of Java and Python respectively.

Problem Statement2:

Execution steps are also documented for running the following in either Python or Java.

Question#1

Write a MapReduce program which will display the number of hits for each different file on the Web site.

  1. Find
    • How many hits were made to the page: /assets/js/the-associates.js?

Code

Java variant

P2Q1.java

Python variant

P2Q1_Mapper.py and P2Q1_Reducer.py

Solution

Please check acc_p2q1.tsv for the output of this problem statement.

Execution Log files

Please check acc_p2q1.txt and acc_p2q1.txt for command line execution log files of Java and Python respectively.

Question#2

Write a MapReduce program which determines the number of hits to the site made by each different IP Address.

  1. Find
    • How many hits were made by the IP address: 10.99.99.186?

Code

Java variant

P2Q2.java

Python variant

P2Q2_Mapper.py and P2Q2_Reducer.py

Solution

Please check acc_p2q2.tsv for the output of this problem statement.

Execution Log files

Please check acc_p2q2.txt and acc_p2q2.txt for command line execution log files of Java and Python respectively.

Question#3

Find the most popular file on the Web site. In other words, the file which had the most hits. Your Reducer should just write out the name of the file and number of hits into HDFS.

  1. Find
    • Full path to the most popular file?
    • Number of hits to that file?

Code

Java variant

P2Q3.java

Python variant

P2Q3_Mapper.py and P2Q3_Reducer.py

Solution

Please check acc_p2q3.tsv for the output of this problem statement.

Execution Log files

Please check acc_p2q3.txt and acc_p2q3.txt for command line execution log files of Java and Python respectively.

License

Copyright © 2013 Prashanth Babu.
Licensed under the Apache License, Version 2.0.

More Repositories

1

Spark-MLlib-Twitter-Sentiment-Analysis

🌟 ✨ Analyze and visualize Twitter Sentiment on a world map using Spark MLlib
Scala
130
star
2

docker-spark

🚢 Docker image for Apache Spark
76
star
3

StormTweetsSentimentAnalysis

Computes sentiment analysis of tweets of US States in real-time using Storm.
Java
63
star
4

storm-camel-example

Real-time analysis and visualization with Storm-AMQ-Camel-Websockets-Highcharts integration.
Java
26
star
5

StormTweetsSentimentD3Viz

Computes and visualizes the sentiment analysis of tweets of US States in real-time using Storm.
Java
26
star
6

p7hb-docker-mllib-twitter-sentiment

🚢 Docker image for Twitter Sentiment analysis with Spark MLlib
Shell
7
star
7

Scala-for-the-Impatient__Solutions

My solutions for Scala for the Impatient, 2nd ed.
Jupyter Notebook
4
star
8

StormTweetsSentimentD3UKViz

Computes and visualizes the sentiment analysis of tweets of UK Counties / Regions in real-time using Storm.
Java
4
star
9

Real-Time_Analytics_with_Apache_Storm__Udacity_Course

Source code for lessons and assignments of Udacity-Twitter course "Real-Time Analytics with Apache Storm".
Java
4
star
10

StormTweetsWordCount

Reads Twitter stream and counts words present in Tweets in real-time using Storm.
Java
4
star
11

Coursera__UCSD__Big_Data_Specialization

"Big Data" Specialization -- University of California, San Diego and Coursera
Jupyter Notebook
3
star
12

StormWordCount

Application to demonstrate Storm distributed framework by counting the words [from random sentences] in real-time. This project does not need internet access.
Java
3
star
13

Functional_Programming__HackerRank

Haskell and Scala solutions for Functional Programming challenges on HackerRank.
Scala
2
star
14

edX__Introduction_to_Big_Data_with_Apache_Spark

edX course "Introduction to Big Data with Apache Spark"
2
star
15

docker-haskell

🚢 Docker Image for Stack or GHC and Cabal
2
star
16

ScalaSparkWordCount

Wordcount impl in Scala for Spark
XSLT
1
star
17

IntroToDataScience__Coursera_Course

Assignment source code of Coursera course for Introduction to Data Science
Python
1
star
18

p7hb-docker-spark

🚢 Docker image for Apache Spark
1
star
19

FutureLearn__Learn_to_Code_for_Data_Analysis

My notes of FutureLearn MOOC "Learn to Code for Data Analysis"
Jupyter Notebook
1
star
20

FutureLearn__FP_in_Haskell

My notes of FutureLearn MOOC "Functional Programming in Haskell"
HTML
1
star
21

Data-Visualization-and-Infographics-with-D3

Examples and exercises of online course: "Data Visualization and Infographics with D3".
HTML
1
star
22

Programming_in_Haskell

Code snippets and solutions to exercises of "Programming in Haskell".
Haskell
1
star
23

MOOC_Certifications

Various Certifications and other MOOC Certificates
1
star
24

IntroToPig__5thElephant_Workshop

Code for the workshop on _Introduction to Pig_ at Fifth Elephant, Bangalore, India on 26th July, 2012.
1
star
25

tmux-cheatsheet

tmux shortcuts & cheatsheet
1
star