• Stars
    star
    1,054
  • Rank 43,744 (Top 0.9 %)
  • Language
    Python
  • License
    MIT License
  • Created over 8 years ago
  • Updated over 5 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A Python tool that automatically cleans data sets and readies them for analysis.

Build Status Code Health Coverage Status Python 2.7 Python 3.5 License PyPI version

datacleaner

Join the chat at https://gitter.im/rhiever/datacleaner

A Python tool that automatically cleans data sets and readies them for analysis.

datacleaner is not magic

datacleaner works with data in pandas DataFrames.

datacleaner is not magic, and it won't take an unorganized blob of text and automagically parse it out for you.

What datacleaner will do is save you a ton of time encoding and cleaning your data once it's already in a format that pandas DataFrames can handle.

Currently, datacleaner does the following:

  • Optionally drops any row with a missing value

  • Replaces missing values with the mode (for categorical variables) or median (for continuous variables) on a column-by-column basis

  • Encodes non-numerical variables (e.g., categorical variables with strings) with numerical equivalents

We plan to add more cleaning features as the project grows.

License

Please see the repository license for the licensing and usage information for datacleaner.

Generally, we have licensed datacleaner to make it as widely usable as possible.

Installation

datacleaner is built to use pandas DataFrames and some scikit-learn modules for data preprocessing. As such, we recommend installing the Anaconda Python distribution prior to installing datacleaner.

Once the prerequisites are installed, datacleaner can be installed with a simple pip command:

pip install datacleaner

Usage

datacleaner on the command line

datacleaner can be used on the command line. Use --help to see its usage instructions.

usage: datacleaner [-h] [-cv CROSS_VAL_FILENAME] [-o OUTPUT_FILENAME]
                   [-cvo CV_OUTPUT_FILENAME] [-is INPUT_SEPARATOR]
                   [-os OUTPUT_SEPARATOR] [--drop-nans]
                   [--ignore-update-check] [--version]
                   INPUT_FILENAME

A Python tool that automatically cleans data sets and readies them for analysis

positional arguments:
  INPUT_FILENAME        File name of the data file to clean

optional arguments:
  -h, --help            show this help message and exit
  -cv CROSS_VAL_FILENAME
                        File name for the validation data set if performing
                        cross-validation
  -o OUTPUT_FILENAME    Data file to output the cleaned data set to
  -cvo CV_OUTPUT_FILENAME
                        Data file to output the cleaned cross-validation data
                        set to
  -is INPUT_SEPARATOR   Column separator for the input file(s) (default: \t)
  -os OUTPUT_SEPARATOR  Column separator for the output file(s) (default: \t)
  --drop-nans           Drop all rows that have a NaN in any column (default: False)
  --ignore-update-check
                        Do not check for the latest version of datacleaner
                        (default: False)
  --version             show program's version number and exit

An example command-line call to datacleaner may look like:

datacleaner my_data.csv -o my_clean.data.csv -is , -os ,

which will read the data from my_data.csv (assuming columns are separated by commas), clean the data set, then output the resulting data set to my_clean.data.csv.

datacleaner in scripts

datacleaner can also be used as part of a script. There are two primary functions implemented in datacleaner: autoclean and autoclean_cv.

autoclean(input_dataframe, drop_nans=False, copy=False, ignore_update_check=False)
    Performs a series of automated data cleaning transformations on the provided data set
    
    Parameters
    ----------
    input_dataframe: pandas.DataFrame
        Data set to clean
    drop_nans: bool
        Drop all rows that have a NaN in any column (default: False)
    copy: bool
        Make a copy of the data set (default: False) 
    encoder: category_encoders transformer
        The a valid category_encoders transformer which is passed an inferred cols list. Default (None: LabelEncoder)
    encoder_kwargs: category_encoders
        The a valid sklearn transformer to encode categorical features. Default (None)
    ignore_update_check: bool
        Do not check for the latest version of datacleaner

    Returns
    ----------
    output_dataframe: pandas.DataFrame
        Cleaned data set
autoclean_cv(training_dataframe, testing_dataframe, drop_nans=False, copy=False, ignore_update_check=False)
    Performs a series of automated data cleaning transformations on the provided training and testing data sets
    
    Unlike `autoclean()`, this function takes cross-validation into account by learning the data transformations
    from only the training set, then applying those transformations to both the training and testing set.
    By doing so, this function will prevent information leak from the training set into the testing set.
    
    Parameters
    ----------
    training_dataframe: pandas.DataFrame
        Training data set
    testing_dataframe: pandas.DataFrame
        Testing data set
    drop_nans: bool
        Drop all rows that have a NaN in any column (default: False)
    copy: bool
        Make a copy of the data set (default: False)  
    encoder: category_encoders transformer
        The a valid category_encoders transformer which is passed an inferred cols list. Default (None: LabelEncoder)
    encoder_kwargs: category_encoders
        The a valid sklearn transformer to encode categorical features. Default (None)
    ignore_update_check: bool
        Do not check for the latest version of datacleaner

    Returns
    ----------
    output_training_dataframe: pandas.DataFrame
        Cleaned training data set
    output_testing_dataframe: pandas.DataFrame
        Cleaned testing data set

Below is an example of datacleaner performing basic cleaning on a data set.

from datacleaner import autoclean
import pandas as pd

my_data = pd.read_csv('my_data.csv', sep=',')
my_clean_data = autoclean(my_data)
my_data.to_csv('my_clean_data.csv', sep=',', index=False)

Note that because datacleaner works directly on pandas DataFrames, all DataFrame operations are still available to the resulting data sets.

Contributing to datacleaner

We welcome you to check the existing issues for bugs or enhancements to work on. If you have an idea for an extension to datacleaner, please file a new issue so we can discuss it.

Citing datacleaner

If you use datacleaner as part of your workflow in a scientific publication, please consider citing the datacleaner repository with the following DOI:

DOI

More Repositories

1

Data-Analysis-and-Machine-Learning-Projects

Repository of teaching materials, code, and data for my data analysis and machine learning projects.
Jupyter Notebook
6,107
star
2

TwitterFollowBot

A Python bot that automates several actions on Twitter, such as following users and favoriting tweets.
Python
1,309
star
3

reddit-analysis

A Python script that parses post titles, self-texts, and comments on reddit and makes word clouds out of the word frequencies.
Python
285
star
4

optimal-roadtrip-usa

Contains maps for the article, "Computing the optimal road trip across the U.S." and similar articles
HTML
230
star
5

sklearn-benchmarks

A centralized repository to report scikit-learn model performance across a variety of parameter settings and data sets.
Jupyter Notebook
210
star
6

python-data-visualization-course

Course materials for teaching data visualization in Python.
Jupyter Notebook
169
star
7

reddit-twitter-bot

Looks up posts from reddit and automatically posts them on Twitter.
Python
137
star
8

name-age-calculator

Analyzes a name and guesses the age range of a person with that name.
HTML
43
star
9

redditviz

An interactive map of reddit: the "front page of the internet"
CSS
38
star
10

MarkovNetwork

Python implementation of Markov Networks for neural computing.
Python
36
star
11

ipython-notebook-workshop

Beginner's IPython Notebook Tutorial
19
star
12

baby-name-explorer

HTML
17
star
13

network-analysis-scripts

A bunch of useful scripts for analyzing networks.
Python
13
star
14

active-categorical-classifier

A tool that evolves small brains capable of scanning and classifying an image.
Jupyter Notebook
12
star
15

k-fold-cv-benchmark

Python
9
star
16

optimized-us-capitol-road-trip

HTML
9
star
17

crowd-machines

Jupyter Notebook
8
star
18

xrff2csv

A Python tool that converts XRFF files to CSV format.
Python
7
star
19

edd

A tool that evolves small brains capable of scanning and classifying an image.
C++
7
star
20

rhiever.github.io

Dr. Randal Olson's personal website
HTML
5
star
21

Collective-Cognition-Increases-Accuracy

Code for the model in the paper, "Accurate decisions in an uncertain world: collective cognition increases true positives while decreasing false positives."
Python
5
star
22

rhiever-bot

Bot that monitors /r/MUWs and runs the MUW script.
Python
4
star
23

big-ten-twitter-network

Interactive visualization of the Big Ten football teams on Twitter
JavaScript
3
star
24

biped-hyperneat

ODE implementation of a walking biped robot with HyperNEAT evolving the neural controller
PHP
3
star
25

dissertation-topic-network

Dissertation topic network
3
star
26

big-data-hw

2
star
27

Intro-to-Evolutionary-Modeling

Material for teaching biologists to work with digital evolutionary models.
2
star
28

rmagic-tutorial

A brief tutorial showing how Rmagic can be used in IPython Notebook.
2
star
29

marriage-divorce-stats

144 years of marriage and divorce in 1 chart
HTML
1
star
30

EvoRoboCodeGECCO2013

Description of our EvoRoboCode competition submission to GECCO 2013.
1
star
31

drug-alcohol-mentions

1
star
32

2014-01-30-mit

Software Carpentry bootcamp at Massachusetts Institute of Technology on January 30-31, 2014
Python
1
star
33

betting-game

Game Theory: betting game
C++
1
star
34

temp-repo

HTML
1
star
35

eos-old

Evolution of Swarming Platform
C++
1
star
36

ipython-example

Example notebook showing how to do statistics in IPython Notebook.
Python
1
star
37

AMT-biped-analysis

1
star
38

eos-active-perception

EOS with agents who have to actively perceive the environment with a fine-grained retina.
C++
1
star