• Stars
    star
    457
  • Rank 95,775 (Top 2 %)
  • Language
    Jupyter Notebook
  • License
    MIT License
  • Created over 8 years ago
  • Updated 5 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Code for Agile Data Science 2.0, O'Reilly 2017, Second Edition

Agile Data Science 2.0 (O'Reilly, 2017)

This repository contains the updated sourcec code for Agile Data Science 2.0, O'Reilly 2017. Now available at the O'Reilly Store, on Amazon (in Paperback and Kindle) and on O'Reilly Safari. Also available anywhere technical books are sold!

It was last updated to a fully running version in late October, 2021. You should refer to the Jupyter Notebooks in this repository rather than the book's source code, which is badly outdated and will no longer work for you.

Have problems? Please file an issue!

Deep Discovery

Like my work? Connect with me on LinkedIn!

Installation and Execution

There is now only ONE version of the install: Docker via the docker-compose.yml. It is MUCH EASIER than the old methods.

To build the agile Docker image, run this:

docker-compose build agile

To run the agile Docker image, defined by the docker-compose.yml and Dockerfile, run:

docker-compose up -d

Now visit: http://localhost:8888

Other Images

To manage the mongo image with Mongo Express, visit: http://localhost:8081

Downloading Data

Once the server comes up, download the data and you are ready to go. First open a shell in Jupyter Lab. The working directory corresponds to this folder.

Now download the data:

./download.sh

Running Examples

All scripts run from the base directory, except the web app which runs in ex. ch08/web/. Open Welcome.ipynb and get started.

Jupyter Notebooks

All notebooks assume you have run the jupyter notebook command from the project root directory Agile_Data_Code_2. If you are using a virtual machine image (Vagrant/Virtualbox or EC2), jupyter notebook is already running. See directions on port mapping to proceed.

The Data Value Pyramid

Originally by Pete Warden, the data value pyramid is how the book is organized and structured. We climb it as we go forward each chapter.

Data Value Pyramid

System Architecture

The following diagrams are pulled from the book, and express the basic concepts in the system architecture. The front and back end architectures work together to make a complete predictive system.

Front End Architecture

This diagram shows how the front end architecture works in our flight delay prediction application. The user fills out a form with some basic information in a form on a web page, which is submitted to the server. The server fills out some neccesary fields derived from those in the form like "day of year" and emits a Kafka message containing a prediction request. Spark Streaming is listening on a Kafka queue for these requests, and makes the prediction, storing the result in MongoDB. Meanwhile, the client has received a UUID in the form's response, and has been polling another endpoint every second. Once the data is available in Mongo, the client's next request picks it up. Finally, the client displays the result of the prediction to the user!

This setup is extremely fun to setup, operate and watch. Check out chapters 7 and 8 for more information!

Front End Architecture

Back End Architecture

The back end architecture diagram shows how we train a classifier model using historical data (all flights from 2015) on disk (HDFS or Amazon S3, etc.) to predict flight delays in batch in Spark. We save the model to disk when it is ready. Next, we launch Zookeeper and a Kafka queue. We use Spark Streaming to load the classifier model, and then listen for prediction requests in a Kafka queue. When a prediction request arrives, Spark Streaming makes the prediction, storing the result in MongoDB where the web application can pick it up.

This architecture is extremely powerful, and it is a huge benefit that we get to use the same code in batch and in realtime with PySpark Streaming.

Backend Architecture

Screenshots

Below are some examples of parts of the application we build in this book and in this repo. Check out the book for more!

Airline Entity Page

Each airline gets its own entity page, complete with a summary of its fleet and a description pulled from Wikipedia.

Airline Page

Airplane Fleet Page

We demonstrate summarizing an entity with an airplane fleet page which describes the entire fleet.

Airplane Fleet Page

Flight Delay Prediction UI

We create an entire realtime predictive system with a web front-end to submit prediction requests.

Predicting Flight Delays UI

More Repositories

1

Agile_Data_Code

Chapter-wise code for Agile Data the O'Reilly book
JavaScript
158
star
2

weakly_supervised_learning_code

The source code to the book Weakly Supervised Learning (O'Reilly, 2020) by Russell Jurney
Jupyter Notebook
37
star
3

Collecting-Data

This is a HOWTO for collecting data in Ruby and Python applications and sending it to S3 via Kafka.
Python
31
star
4

enron-avro

Code for creating and querying an Avro encoded repository of the UC Berkeley Enron email archive
19
star
5

github-explorer

Recommender system for Github projects using the github archive data
Python
17
star
6

enron-python-flask-cassandra-pig

Hortonworks demo of Enron emails with Pig, Cassandra, Python and Flask
Python
17
star
7

Cloud-Stenography

Main Repo
Java
15
star
8

enron-node-mongo

Building a simple Node application with Pig, MongoDB, Node.js and the Enron Emails
JavaScript
13
star
9

pig-to-json

A Pig to JSON UDF for Pig that converts tuples and bags to JSON strings
Java
13
star
10

enron-elasticsearch

Pig/ElasticSearch/Wonderdog example with the Enron Emails
12
star
11

coursera_machine_learning

Python examples of the homework examples for Andrew Ng's Stanford Machine Learning class on Coursera
Python
11
star
12

github_network

Experimentation with Github data as a network
Jupyter Notebook
8
star
13

amazon_open_source

Analyzing Amazon's Free and Open Source Software (FOSS) contributions
Jupyter Notebook
8
star
14

Booting-the-Analytics-Application

Data Syndrome HOWTO
Python
6
star
15

disco

A library for company name parsing based on cleanco
Python
4
star
16

enron-hive

Working with the Enron emails in Pig and HIVE
4
star
17

enron-jruby-sinatra-hbase-pig

Hortonworks demo of Enron emails using Hadoop, Pig, HBase, JRuby, Sinatra
Ruby
4
star
18

enron-pig-tojson-redis-node

Enron Emails -> Pig ->ToJson -> RedisStorer -> Node.js
Python
3
star
19

druid-application-development

A Realtime Chart Web Application Development with Druid
Python
3
star
20

libpostal-reborn

Code to go with my blog post, Libpostal, Reborn!
Jupyter Notebook
3
star
21

paas_blog

A series of blog posts exploring PaaS for automating data science tasks
Jupyter Notebook
3
star
22

timeseriesserde

A time series serde for HIVE
Java
2
star
23

deep_products

A book on building products using deep learning and natural language processing
Jupyter Notebook
2
star
24

enron-hcatalog

Using HCatalog with the Enron Avro dataset
2
star
25

hive_tweets

Process your tweets in Apache Hive
Python
2
star
26

property_graph_analytics

A forthcoming book on property graph analytics
2
star
27

nltk_exercises

Working through the nltk book
Python
2
star
28

Dattack

Ruby
2
star
29

commoncrawl-pig-arcfileloader-udf-storefunc

Pig ArcFileLoader examples for loading the Common Crawl internet data
2
star
30

enron-pig-accumulo

Example of using Pig with Accumulo on the Berkely enron emails
1
star
31

baby_names

Project for US Baby Names example dashboard on Apache Superset
Jupyter Notebook
1
star
32

open_business_graph

Code relating to the Relato Business Graph on data.world
Groovy
1
star
33

deep_learning

Deep learning tools and utilities
1
star
34

superset_postgres_github

A project to wrangle github event data into Postgres for Superset to analyze
Jupyter Notebook
1
star
35

LinearAlgebra

A Processing project to visualize all of Linear Algebra! :)
Processing
1
star
36

druid-python-demo

Demonstration of druid, pyDruid, Flask and d3.js
JavaScript
1
star
37

addressbook_extensions

Titanium AddressBook extensions for iOS.
Python
1
star
38

quantum_ai_readme

README cataloging resources for learning about Quantum Computing applications in Artificial Intelligence
1
star
39

atlanta-directory-project

Processing Atlanta Directories from Emory University to understand the demographics of race and class in Atlanta in the Late 19th and early 20th centuries
1
star