• Stars
    star
    279
  • Rank 147,967 (Top 3 %)
  • Language
    Jupyter Notebook
  • Created over 7 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Data Cleaning Libraries with Python

Data Cleaning 101

Welcome to the code repository for Practical Data Cleaning with Python! This is a two-day training offered through Safari with O'Reilly media. You can sign up by searching for the course on Safari.

This course aims to give you a practical overview of data cleaning and validation libraries and methods in Python. Since we only have 6 hours, it can't go massively in-depth into any one library or tool, but I have tried to include useful tools I have found in my work and incorporate a mixture of the munging and testing I have seen in my own and others workflows.

If you have a suggestion for another library or additional topic, feel free to drop me a line :)

Installation

These lessons has been tested for Python 3.4 and Python 3.6 and primarily uses the latest release of each library, except where versions are pinned. You likely can run most of the code with older releases, but if you run into an issue, try upgrading the library in question first.

pip install -r install_reqs.txt

I believe this will also work with Conda, although I am less familiar with Conda so please report issues! (special thanks to @blue_hacker for this fix!)

$ conda create -n dataclean --copy python=3.6
$ source activate dataclean
$ pip install -r install_reqs.txt

In addition, you will need to install sqlite3 or make changes to the second day case study with a connection string to your database of choice. more info

If you want to visualize graphs using Dask, you will need to install Graphviz, which has special requirements on all platforms. For linux, it is usually available via the system package library (apt, yum). For other platforms, you might need to use a special installer. It is also available via conda install graphviz and pip install graphviz, but these might not include all necessary dependencies for your OS. For best results, search for your OS and "install graphviz and dependencies" and follow a recent article on setup.

Repository structure

Each day coincides with a particular notebook folder. For day one, we will use cleaning-notebooks. Day two will focus on validation-notebooks. The data folder holds data we will use throughout the course. The queue_example.py file is used in the day two case study.

Python2 v. Python3

This repository has been built with Python 3. If you are using Python 2 and need help porting some logic or finding alternatives, please let me know and I will try and help. :)

Corrections?

If you find any issues in these code examples, feel free to submit an Issue or Pull Request. I appreciate your input!

Questions?

Reach out to @kjam on Twitter or GitHub. @kjam is also often on freenode. :)

More Repositories

1

python-web-scraping-tutorial

A Python-based web and data scraping tutorial
Python
210
star
2

data-pipelines-course

Course materials for my data pipeline video course with O'Reilly
Jupyter Notebook
194
star
3

wswp

Code for the second edition Web Scraping with Python book by Packt Publications
Python
130
star
4

data-wrangling-pycon

An Introduction to Data Wrangling with Python
Jupyter Notebook
81
star
5

practical-data-privacy

Practical Data Privacy
Jupyter Notebook
70
star
6

python_flight_search

Using Python to search for flights.
Python
54
star
7

datafuzz

A data science Python library aimed at adding fuzz, noise and other issues to your data for testing purposes.
Python
30
star
8

data-wrangling-video

Code and examples for O'Reilly's Data Wrangling with Python video course
Jupyter Notebook
28
star
9

intro-to-ml

A basic introduction to machine learning (one day training).
Jupyter Notebook
16
star
10

random_hackery

Just little bits.
Jupyter Notebook
10
star
11

europarl_scraper

European Parliament website Python scraper
Jupyter Notebook
9
star
12

uf-data-mining-and-analysis

University of Florida Data Mining and Analysis
Jupyter Notebook
8
star
13

web-scraping-speed-comparison

A Python web scraping speed comparison
Python
6
star
14

uf-intro-to-programming

University of Florida Audience Analytics Introduction to Programming with Data course
HTML
6
star
15

cherrypy-poll

Polling with cherrypy: A beginner's project guide to python programming
Python
6
star
16

kjam-datalab-notebooks

Some Example Jupyter Notebooks using Google's DataLab
4
star
17

cron-parser

Python script that allows you to easily update a server cron that has many different projects without overwriting other crons.
Python
1
star
18

chatbot_scraper

Python scraper(s) for chatbot logs. Currently supports botbot.me logs.
Python
1
star