• Stars
    star
    3,634
  • Rank 11,620 (Top 0.3 %)
  • Language
    Python
  • License
    MIT License
  • Created about 8 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Missing data visualization module for Python.

missingno PyPi version t

Messy datasets? Missing values? missingno provides a small toolset of flexible and easy-to-use missing data visualizations and utilities that allows you to get a quick visual summary of the completeness (or lack thereof) of your dataset. Just pip install missingno to get started.

quickstart

This quickstart uses a sample of the NYPD Motor Vehicle Collisions Dataset dataset.

import pandas as pd
collisions = pd.read_csv("https://raw.githubusercontent.com/ResidentMario/missingno-data/master/nyc_collision_factors.csv")

matrix

The msno.matrix nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.

import missingno as msno
%matplotlib inline
msno.matrix(collisions.sample(250))

alt text

At a glance, date, time, the distribution of injuries, and the contribution factor of the first vehicle appear to be completely populated, while geographic information seems mostly complete, but spottier.

The sparkline at right summarizes the general shape of the data completeness and points out the rows with the maximum and minimum nullity in the dataset.

This visualization will comfortably accommodate up to 50 labelled variables. Past that range labels begin to overlap or become unreadable, and by default large displays omit them.

If you are working with time-series data, you can specify a periodicity using the freq keyword parameter:

null_pattern = (np.random.random(1000).reshape((50, 20)) > 0.5).astype(bool)
null_pattern = pd.DataFrame(null_pattern).replace({False: None})
msno.matrix(null_pattern.set_index(pd.period_range('1/1/2011', '2/1/2015', freq='M')) , freq='BQ')

alt text

bar

msno.bar is a simple visualization of nullity by column:

msno.bar(collisions.sample(1000))

alt text

You can switch to a logarithmic scale by specifying log=True. bar provides the same information as matrix, but in a simpler format.

heatmap

The missingno correlation heatmap measures nullity correlation: how strongly the presence or absence of one variable affects the presence of another:

msno.heatmap(collisions)

alt text

In this example, it seems that reports which are filed with an OFF STREET NAME variable are less likely to have complete geographic data.

Nullity correlation ranges from -1 (if one variable appears the other definitely does not) to 0 (variables appearing or not appearing have no effect on one another) to 1 (if one variable appears the other definitely also does).

The exact algorithm used is:

import numpy as np

# df is a pandas.DataFrame instance
df = df.iloc[:, [i for i, n in enumerate(np.var(df.isnull(), axis='rows')) if n > 0]]
corr_mat = df.isnull().corr()

Variables that are always full or always empty have no meaningful correlation, and so are silently removed from the visualization—in this case for instance the datetime and injury number columns, which are completely filled, are not included.

Entries marked <1 or >-1 have a correlation that is close to being exactingly negative or positive, but is still not quite perfectly so. This points to a small number of records in the dataset which are erroneous. For example, in this dataset the correlation between VEHICLE CODE TYPE 3 and CONTRIBUTING FACTOR VEHICLE 3 is <1, indicating that, contrary to our expectation, there are a few records which have one or the other, but not both. These cases will require special attention.

The heatmap works great for picking out data completeness relationships between variable pairs, but its explanatory power is limited when it comes to larger relationships and it has no particular support for extremely large datasets.

dendrogram

The dendrogram allows you to more fully correlate variable completion, revealing trends deeper than the pairwise ones visible in the correlation heatmap:

msno.dendrogram(collisions)

alt text

The dendrogram uses a hierarchical clustering algorithm (courtesy of scipy) to bin variables against one another by their nullity correlation (measured in terms of binary distance). At each step of the tree the variables are split up based on which combination minimizes the distance of the remaining clusters. The more monotone the set of variables, the closer their total distance is to zero, and the closer their average distance (the y-axis) is to zero.

The exact algorithm used is:

from scipy.cluster import hierarchy
import numpy as np

# df is a pandas.DataFrame instance
x = np.transpose(df.isnull().astype(int).values)
z = hierarchy.linkage(x, method)

To interpret this graph, read it from a top-down perspective. Cluster leaves which linked together at a distance of zero fully predict one another's presence—one variable might always be empty when another is filled, or they might always both be filled or both empty, and so on. In this specific example the dendrogram glues together the variables which are required and therefore present in every record.

Cluster leaves which split close to zero, but not at it, predict one another very well, but still imperfectly. If your own interpretation of the dataset is that these columns actually are or ought to be match each other in nullity (for example, as CONTRIBUTING FACTOR VEHICLE 2 and VEHICLE TYPE CODE 2 ought to), then the height of the cluster leaf tells you, in absolute terms, how often the records are "mismatched" or incorrectly filed—that is, how many values you would have to fill in or drop, if you are so inclined.

As with matrix, only up to 50 labeled columns will comfortably display in this configuration. However the dendrogram more elegantly handles extremely large datasets by simply flipping to a horizontal configuration.

configuration

For more advanced configuration details for your plots, refer to the CONFIGURATION.md file in this repository.

contributing

For thoughts on features or bug reports see Issues. If you're interested in contributing to this library, see details on doing so in the CONTRIBUTING.md file in this repository. If doing so, keep in mind that missingno is currently in a maintenance state, so while bugfixes are welcome, I am unlikely to review or land any new major library features.

More Repositories

1

geoplot

High-level geospatial data visualization library for Python.
Python
1,081
star
2

py_d3

D3 block magic for Jupyter notebook.
Python
452
star
3

designing-data-intensive-applications-notes

Reading notes on the excellent "Designing Data-Intensive Applications"
Jupyter Notebook
196
star
4

boston-airbnb-geo

A Deep Dive into Geospatial Analysis in Python (Tutorial)
Jupyter Notebook
54
star
5

gtfs-tripify

Turn GTFS-RT transit updates into historical arrival data.
Python
41
star
6

wargame-data

An export of unit values from the strategic real-time strategy game Wargame: Red Dragon.
Jupyter Notebook
27
star
7

watsongraph

Concept discovery and recommendation library built on top of the IBM Watson cognitive API.
Python
24
star
8

progressive-resizing

Applying progressive resizing to building models in Keras.
Jupyter Notebook
18
star
9

checkpoints

Partial result caching for pandas in Python.
Python
16
star
10

hurdat2

Process files and docs for the standardization of NOAA HURDAT2 database.
Jupyter Notebook
14
star
11

geoplot-data

Raw data files used by the geoplot examples and documentation
11
star
12

advanced-pandas-exercises

https://www.kaggle.com/residentmario/advanced-pandas-exercises
Python
9
star
13

streetmapper

Geospatial tools for working with census blocks, street networks, building footprints in Python
Python
8
star
14

citibike

Data processing for a CitiBike trip data visualization.
Jupyter Notebook
8
star
15

socrata-portal-metadata

Python module for exploring Open Data Portal metadata.
Jupyter Notebook
8
star
16

data-science-team

Presentation materials from the prospective CUNY Baruch Data Science Competition Team.
HTML
7
star
17

wargame

A LaTeX guide to unit values in the strategic RTS game "Wargame: Red Dragon".
TeX
6
star
18

signpostviews

A Jupyter notebook and some associated files studying Signpost article viewership.
Jupyter Notebook
5
star
19

acris

Overview of the New York City Department of Finance's ACRIS dataset(s).
Jupyter Notebook
5
star
20

yellowbrick-x-keras

How to use the yellowbrick metrics dataviz library with keras
Jupyter Notebook
4
star
21

airscooter

Command-line utility for simple graph-based data workflows.
Python
4
star
22

python-missing-data

NOW A BLOG POST: http://www.residentmar.io/2016/06/12/null-and-missing-data-python.html
Jupyter Notebook
4
star
23

dt-guide

LaTeX guide written to accompany Dwarf Therapist.
TeX
4
star
24

motor-vehicle-collisions

Using NYPD Motor Vehicle Collisions data from the NYC Open Data portal to study traffic accidents in New York City.
Jupyter Notebook
4
star
25

chain-incidence

Data munging for the subject of http://www.residentmar.io/2016/02/09/average-chain-distance.html and http://gothamist.com/2016/02/09/starbucks_ubiquitous.php
HTML
3
star
26

co_reader

A module for retrieving recent NYC DOB certificate of occupancy issuance dates.
Python
3
star
27

pytorch-training-performance-guide

Guidebook and reference on PyTorch training optimizations
TeX
3
star
28

subway-explorer-api

API for ground-truth MTA subway arrival and departure times.
JavaScript
3
star
29

nyc-building-values

Jupyter Notebook
3
star
30

go-bike-sankey

http://www.residentmar.io/2019/01/15/ford-go-bike-maps.html
Jupyter Notebook
3
star
31

mta-data-exploration-old

Dev materials for what became the gtfs-tripify module
Jupyter Notebook
3
star
32

fahr

Run remote machine learning model training jobs right from the command line.
Python
3
star
33

machine-learning-notes

A microsite hosting my machine learning notes.
HTML
2
star
34

plotting-tools

Jupyter Notebook
2
star
35

rubbish-geo

Python
2
star
36

blogimporter

A script that handles tedius setup tasks for the Blog section of the Wikipedia Signpost.
Python
1
star
37

streaming-algos

Notebooks implementing a selection of streaming quantile algorithms.
Jupyter Notebook
1
star
38

data-structures-js

Implementing my way through a bunch of algos in JavaScript and Python.
Jupyter Notebook
1
star
39

coding-challenges-again

Jupyter Notebook
1
star
40

nyc-transit-archive-old

An archival service for NYC transit data.
Python
1
star
41

mysite

The source code for my personal website.
HTML
1
star
42

nyc-transit-archive

Jupyter Notebook
1
star
43

quilt-sagemaker-demo

Jupyter Notebook
1
star
44

neural-network-notes

My notes from Geoffrey Hinton's neural networks Coursera course
Jupyter Notebook
1
star
45

rust-learn

Rust
1
star
46

data-visualization-blogging

Slides from a presentation on blogging and you.
1
star
47

nyc-gentrification-1

HTML
1
star
48

nyc-active-construction-sites

HTML
1
star
49

trash-talk

Learnings from garbage.
Jupyter Notebook
1
star
50

airflow-playground

Python
1
star
51

geoplot-images

Test images for the geoplot library
1
star
52

nyc-tobacco

A calculation of mean distance to tobacco stores in NYC. Based on NYC Open Data.
Jupyter Notebook
1
star
53

nyc-real-estate-sales

Data munging for a blog post on real estate concentration in New York City.
Jupyter Notebook
1
star
54

implementing-good-design

The slides from a talk I gave for the NYC D3.JS Meetup on the data visualization design process.
1
star
55

subway-explorer-webapp

How long is your commute? Proof-of-concept ground-truth MTA subway arrival time web application.
JavaScript
1
star
56

watsongraph-tutorial

Tutorial materials for the watsongraph library.
Jupyter Notebook
1
star
57

fcimporter

A script that handles tedius setup tasks for the Featured Content Report section of the Wikipedia Signpost.
Python
1
star
58

nyc-buildings

Data munging for the subject of housing value data visualization and mining projects.
Jupyter Notebook
1
star