• Stars
    star
    1,589
  • Rank 28,512 (Top 0.6 %)
  • Language
    Jupyter Notebook
  • Created over 11 years ago
  • Updated over 7 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Tutorial on scikit-learn and IPython for parallel machine learning

Parallel Machine Learning with scikit-learn and IPython

Video Tutorial

Video recording of this tutorial given at PyCon in 2013. The tutorial material has been rearranged in part and extended. Look at the title of the of the notebooks to be able to follow along the presentation.

Browse the static notebooks on nbviewer.ipython.org.

Scope of this tutorial:

  • Learn common machine learning concepts and how they match the scikit-learn Estimator API.

  • Learn about scalable feature extraction for text classification and clustering

  • Learn how to perform parallel cross validation and hyper parameters grid search in parallel with IPython.

  • Learn to analyze the kinds of common errors predictive models are subject to and how to refine your modeling to take this analysis into account.

  • Learn to optimize memory allocation on your computing nodes with numpy memory mapping features.

  • Learn how to run a cheap IPython cluster for interactive predictive modeling on the Amazon EC2 spot instances using StarCluster.

Target audience

This tutorial targets developers with some experience with scikit-learn and machine learning concepts in general.

It is recommended to first go through one of the tutorials hosted at scikit-learn.org if you are new to scikit-learn.

You might might also want to have a look at SciPy Lecture Notes first if you are new to the NumPy / SciPy / matplotlib ecosystem.

Setup

Install NumPy, SciPy, matplotlib, IPython, psutil, and scikit-learn in their latest stable version (e.g. IPython 2.2.0 and scikit-learn 0.15.2 at the time of writing).

You can find up to date installation instructions on scikit-learn.org and ipython.org .

To check your installation, launch the ipython interactive shell in a console and type the following import statements to check each library:

>>> import numpy
>>> import scipy
>>> import matplotlib
>>> import psutil
>>> import sklearn

If you don't get any message, everything is fine. If you get an error message, please ask for help on the mailing list of the matching project and don't forget to mention the version of the library you are trying to install along with the type of platform and version (e.g. Windows 8.1, Ubuntu 14.04, OSX 10.9...).

You can exit the ipython shell by typing exit.

Fetching the data

It is recommended to fetch the datasets ahead of time before diving into the tutorial material itself. To do so run the fetch_data.py script in this folder:

python fetch_data.py

Using the IPython notebook to follow the tutorial

The tutorial material and exercises are hosted in a set of IPython executable notebook files.

To run them interactively do:

$ cd notebooks
$ ipython notebook

This should automatically open a new browser window listing all the notebooks of the folder.

You can then execute the cell in order by hitting the "Shift-Enter" keys and watch the output display directly under the cell and the cursor move on to the next cell. Go to the "Help" menu for links to the notebook tutorial.

Credits

Some of this material is adapted from the scipy 2013 tutorial:

http://github.com/jakevdp/sklearn_scipy2013

Original authors:

More Repositories

1

notebooks

Some sample IPython notebooks for scikit-learn
Jupyter Notebook
556
star
2

pygbm

Experimental Gradient Boosting Machines in Python with numba.
Python
177
star
3

pignlproc

Apache Pig utilities to build training corpora for machine learning / NLP out of public Wikipedia and DBpedia dumps.
Java
159
star
4

python-appveyor-demo

Demo project for building Python wheels with appveyor.com
PowerShell
153
star
5

docker-distributed

Experimental docker-compose setup to bootstrap distributed on a docker-swarm cluster.
Shell
92
star
6

spylearn

Repo for experiments on pyspark and sklearn
Python
79
star
7

paper2ebook

Utility to re-structure research papers published in US Letter or A4 format PDF files to typically remove the 2 columns layout.
Java
53
star
8

text-mining-class

Introduction to web scraping and text mining
Python
43
star
9

dbpediakit

Python utilities to do work with the DBpedia dumps for analytics.
Python
38
star
10

euroscipy-2022-time-series

Tutorial on time-series forcasting with scikit-learn
Jupyter Notebook
32
star
11

wheelhouse-uploader

Script to help maintain a wheelhouse folder on a cloud storage.
Python
31
star
12

my-linux-devbox

Vagrant / Salt configuration with Ubuntu to work on projects related to the scipy stack under Python 3 and Python 2
Scheme
26
star
13

oglearn

ogrisel's utility extensions for scikit-learn
Python
24
star
14

eegssl

Experiments on Self-Supervised Learning on EEG data
Python
16
star
15

mahout

Personal development repository to prepare contributions and patches for Apache Mahout
Java
15
star
16

euroscipy_2017_sklearn

Notebooks for the EuroScipy 2017 tutorial (based on Adult Census income data)
Jupyter Notebook
15
star
17

corpusmaker

clojure utilities to build training corpora for machine learning / NLP out of public wikimedia dumps: status - partially stalled - will probably be reworked as cascalog scripts -- this project is in stalled mode right now: the pignlproc project is likely to replace it due to licensing constraints for future integration in Apache projects
Clojure
14
star
18

python-winbuilder

Tools to script a build environment on Windows for Python project
Python
9
star
19

codemaker

Neural nets-based utility to build low dimensional codes or/and sparse codes
Python
9
star
20

pycon-pydata-sprint

Experimental work for using IPython.parallel with scikit-learn
Python
8
star
21

salt-ipcluster

Salt states and modules to setup an IPython cluster
Scheme
7
star
22

docker-openblas

Docker container with an automated build for OpenBLAS stable branch:
Shell
5
star
23

stanbol-isbn

Demo stanbol extension for detecting and linking ISBN in text document
Java
5
star
24

silva

Leaf recognition prototype
4
star
25

bbuzz-semantic-hackathon

Sandbox for the Berlin Buzzwords semantic hackathon
Java
3
star
26

research

Draft research notes, code and todos
Jupyter Notebook
3
star
27

scikit-learn-github-actions

Test repo for github actions workflows
Python
2
star
28

dask-docker

Docker images for dask-distributed
Jupyter Notebook
2
star
29

ipython-azure

Utilities to deploy a IPython parallel cluster on Windows Azure
Python
2
star
30

lsh_glove

Script to build various LSH / ANN indices on glove word embeddings
Python
2
star
31

cardice

Cloud compute cluster setup with SaltStack
Python
2
star
32

brain2vec

Brain embedding by contextual predictions (draft)
Python
2
star
33

energy_charts

Jupyter Notebook
2
star
34

decks

Slide decks for conferences
CSS
2
star
35

instrumentalist

Python scripts to read XBee sensor data and push it to a couchdb database
Python
2
star
36

mnist-sbi

Simulation Based Inference for the important problem of drawing digits
Python
2
star
37

scikit-learn.org

Source repository to build the HTML website for the scikit-learn project.
Python
1
star
38

camera-html5

Test repo for HTML5 camera access on mobile phones
1
star
39

sandbox

1
star
40

cpython-nightly

Automated build of the master branch of CPython for Continuous Integration purposes
1
star
41

docker-sklearn-openblas

Shell
1
star
42

us-housing-prices-v2-parquet

Exploratory Data Analysis on a parquet dump of https://www.dolthub.com/repositories/dolthub/us-housing-prices-v2 using duckdb and Ibis
Jupyter Notebook
1
star