• Stars
    star
    630
  • Rank 68,477 (Top 2 %)
  • Language
    R
  • License
    Other
  • Created over 8 years ago
  • Updated 6 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Tidy data structures, summaries, and visualisations for missing data

naniar

R-CMD-check Coverage Status CRAN Status Badge CRAN Downloads Each Month lifecycle

naniar provides principled, tidy ways to summarise, visualise, and manipulate missing data with minimal deviations from the workflows in ggplot2 and tidy data. It does this by providing:

  • Shadow matrices, a tidy data structure for missing data:
    • bind_shadow() and nabular()
  • Shorthand summaries for missing data:
    • n_miss() and n_complete()
    • pct_miss()and pct_complete()
  • Numerical summaries of missing data in variables and cases:
    • miss_var_summary() and miss_var_table()
    • miss_case_summary(), miss_case_table()
  • Statistical tests of missingness:
  • Visualisation for missing data:
    • geom_miss_point()
    • gg_miss_var()
    • gg_miss_case()
    • gg_miss_fct()

For more details on the workflow and theory underpinning naniar, read the vignette Getting started with naniar.

For a short primer on the data visualisation available in naniar, read the vignette Gallery of Missing Data Visualisations.

For full details of the package, including

Installation

You can install naniar from CRAN:

install.packages("naniar")

Or you can install the development version on github using remotes:

# install.packages("remotes")
remotes::install_github("njtierney/naniar")

A short overview of naniar

Visualising missing data might sound a little strange - how do you visualise something that is not there? One approach to visualising missing data comes from ggobi and manet, which replaces NA values with values 10% lower than the minimum value in that variable. This visualisation is provided with the geom_miss_point() ggplot2 geom, which we illustrate by exploring the relationship between Ozone and Solar radiation from the airquality dataset.

library(ggplot2)

ggplot(data = airquality,
       aes(x = Ozone,
           y = Solar.R)) +
  geom_point()
#> Warning: Removed 42 rows containing missing values or values outside the scale range
#> (`geom_point()`).

ggplot2 does not handle these missing values, and we get a warning message about the missing values.

We can instead use geom_miss_point() to display the missing data

library(naniar)

ggplot(data = airquality,
       aes(x = Ozone,
           y = Solar.R)) +
  geom_miss_point()

geom_miss_point() has shifted the missing values to now be 10% below the minimum value. The missing values are a different colour so that missingness becomes pre-attentive. As it is a ggplot2 geom, it supports features like faceting and other ggplot features.

p1 <-
ggplot(data = airquality,
       aes(x = Ozone,
           y = Solar.R)) + 
  geom_miss_point() + 
  facet_wrap(~Month, ncol = 2) + 
  theme(legend.position = "bottom")

p1

Data Structures

naniar provides a data structure for working with missing data, the shadow matrix (Swayne and Buja, 1998). The shadow matrix is the same dimension as the data, and consists of binary indicators of missingness of data values, where missing is represented as β€œNA”, and not missing is represented as β€œ!NA”, and variable names are kep the same, with the added suffix β€œ_NA” to the variables.

head(airquality)
#>   Ozone Solar.R Wind Temp Month Day
#> 1    41     190  7.4   67     5   1
#> 2    36     118  8.0   72     5   2
#> 3    12     149 12.6   74     5   3
#> 4    18     313 11.5   62     5   4
#> 5    NA      NA 14.3   56     5   5
#> 6    28      NA 14.9   66     5   6

as_shadow(airquality)
#> # A tibble: 153 Γ— 6
#>    Ozone_NA Solar.R_NA Wind_NA Temp_NA Month_NA Day_NA
#>    <fct>    <fct>      <fct>   <fct>   <fct>    <fct> 
#>  1 !NA      !NA        !NA     !NA     !NA      !NA   
#>  2 !NA      !NA        !NA     !NA     !NA      !NA   
#>  3 !NA      !NA        !NA     !NA     !NA      !NA   
#>  4 !NA      !NA        !NA     !NA     !NA      !NA   
#>  5 NA       NA         !NA     !NA     !NA      !NA   
#>  6 !NA      NA         !NA     !NA     !NA      !NA   
#>  7 !NA      !NA        !NA     !NA     !NA      !NA   
#>  8 !NA      !NA        !NA     !NA     !NA      !NA   
#>  9 !NA      !NA        !NA     !NA     !NA      !NA   
#> 10 NA       !NA        !NA     !NA     !NA      !NA   
#> # β„Ή 143 more rows

Binding the shadow data to the data you help keep better track of the missing values. This format is called β€œnabular”, a portmanteau of NA and tabular. You can bind the shadow to the data using bind_shadow or nabular:

bind_shadow(airquality)
#> # A tibble: 153 Γ— 12
#>    Ozone Solar.R  Wind  Temp Month   Day Ozone_NA Solar.R_NA Wind_NA Temp_NA
#>    <int>   <int> <dbl> <int> <int> <int> <fct>    <fct>      <fct>   <fct>  
#>  1    41     190   7.4    67     5     1 !NA      !NA        !NA     !NA    
#>  2    36     118   8      72     5     2 !NA      !NA        !NA     !NA    
#>  3    12     149  12.6    74     5     3 !NA      !NA        !NA     !NA    
#>  4    18     313  11.5    62     5     4 !NA      !NA        !NA     !NA    
#>  5    NA      NA  14.3    56     5     5 NA       NA         !NA     !NA    
#>  6    28      NA  14.9    66     5     6 !NA      NA         !NA     !NA    
#>  7    23     299   8.6    65     5     7 !NA      !NA        !NA     !NA    
#>  8    19      99  13.8    59     5     8 !NA      !NA        !NA     !NA    
#>  9     8      19  20.1    61     5     9 !NA      !NA        !NA     !NA    
#> 10    NA     194   8.6    69     5    10 NA       !NA        !NA     !NA    
#> # β„Ή 143 more rows
#> # β„Ή 2 more variables: Month_NA <fct>, Day_NA <fct>
nabular(airquality)
#> # A tibble: 153 Γ— 12
#>    Ozone Solar.R  Wind  Temp Month   Day Ozone_NA Solar.R_NA Wind_NA Temp_NA
#>    <int>   <int> <dbl> <int> <int> <int> <fct>    <fct>      <fct>   <fct>  
#>  1    41     190   7.4    67     5     1 !NA      !NA        !NA     !NA    
#>  2    36     118   8      72     5     2 !NA      !NA        !NA     !NA    
#>  3    12     149  12.6    74     5     3 !NA      !NA        !NA     !NA    
#>  4    18     313  11.5    62     5     4 !NA      !NA        !NA     !NA    
#>  5    NA      NA  14.3    56     5     5 NA       NA         !NA     !NA    
#>  6    28      NA  14.9    66     5     6 !NA      NA         !NA     !NA    
#>  7    23     299   8.6    65     5     7 !NA      !NA        !NA     !NA    
#>  8    19      99  13.8    59     5     8 !NA      !NA        !NA     !NA    
#>  9     8      19  20.1    61     5     9 !NA      !NA        !NA     !NA    
#> 10    NA     194   8.6    69     5    10 NA       !NA        !NA     !NA    
#> # β„Ή 143 more rows
#> # β„Ή 2 more variables: Month_NA <fct>, Day_NA <fct>

Using the nabular format helps you manage where missing values are in your dataset and make it easy to do visualisations where you split by missingness:

airquality %>%
  bind_shadow() %>%
  ggplot(aes(x = Temp,
             fill = Ozone_NA)) + 
  geom_density(alpha = 0.5)

And even visualise imputations

airquality %>%
  bind_shadow() %>%
  as.data.frame() %>% 
   simputation::impute_lm(Ozone ~ Temp + Solar.R) %>%
  ggplot(aes(x = Solar.R,
             y = Ozone,
             colour = Ozone_NA)) + 
  geom_point()
#> Warning: Removed 7 rows containing missing values or values outside the scale range
#> (`geom_point()`).

Or perform an upset plot - to plot of the combinations of missingness across cases, using the gg_miss_upset function

gg_miss_upset(airquality)

naniar does this while following consistent principles that are easy to read, thanks to the tools of the tidyverse.

naniar also provides handy visualations for each variable:

gg_miss_var(airquality)

Or the number of missings in a given variable at a repeating span

gg_miss_span(pedestrian,
             var = hourly_counts,
             span_every = 1500)

You can read about all of the visualisations in naniar in the vignette Gallery of missing data visualisations using naniar.

naniar also provides handy helpers for calculating the number, proportion, and percentage of missing and complete observations:

n_miss(airquality)
#> [1] 44
n_complete(airquality)
#> [1] 874
prop_miss(airquality)
#> [1] 0.04793028
prop_complete(airquality)
#> [1] 0.9520697
pct_miss(airquality)
#> [1] 4.793028
pct_complete(airquality)
#> [1] 95.20697

Numerical summaries for missing data

naniar provides numerical summaries of missing data, that follow a consistent rule that uses a syntax begining with miss_. Summaries focussing on variables or a single selected variable, start with miss_var_, and summaries for cases (the initial collected row order of the data), they start with miss_case_. All of these functions that return dataframes also work with dplyr’s group_by().

For example, we can look at the number and percent of missings in each case and variable with miss_var_summary(), and miss_case_summary(), which both return output ordered by the number of missing values.

miss_var_summary(airquality)
#> # A tibble: 6 Γ— 3
#>   variable n_miss pct_miss
#>   <chr>     <int>    <num>
#> 1 Ozone        37    24.2 
#> 2 Solar.R       7     4.58
#> 3 Wind          0     0   
#> 4 Temp          0     0   
#> 5 Month         0     0   
#> 6 Day           0     0
miss_case_summary(airquality)
#> # A tibble: 153 Γ— 3
#>     case n_miss pct_miss
#>    <int>  <int>    <dbl>
#>  1     5      2     33.3
#>  2    27      2     33.3
#>  3     6      1     16.7
#>  4    10      1     16.7
#>  5    11      1     16.7
#>  6    25      1     16.7
#>  7    26      1     16.7
#>  8    32      1     16.7
#>  9    33      1     16.7
#> 10    34      1     16.7
#> # β„Ή 143 more rows

You could also group_by() to work out the number of missings in each variable across the levels within it.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
airquality %>%
  group_by(Month) %>%
  miss_var_summary()
#> # A tibble: 25 Γ— 4
#> # Groups:   Month [5]
#>    Month variable n_miss pct_miss
#>    <int> <chr>     <int>    <num>
#>  1     5 Ozone         5     16.1
#>  2     5 Solar.R       4     12.9
#>  3     5 Wind          0      0  
#>  4     5 Temp          0      0  
#>  5     5 Day           0      0  
#>  6     6 Ozone        21     70  
#>  7     6 Solar.R       0      0  
#>  8     6 Wind          0      0  
#>  9     6 Temp          0      0  
#> 10     6 Day           0      0  
#> # β„Ή 15 more rows

You can read more about all of these functions in the vignette β€œGetting Started with naniar”.

Statistical tests of missingness

naniar provides mcar_test() for Little’s (1988) statistical test for missing completely at random (MCAR) data. The null hypothesis in this test is that the data is MCAR, and the test statistic is a chi-squared value. Given the high statistic value and low p-value, we can conclude that the airquality data is not missing completely at random:

mcar_test(airquality)
#> # A tibble: 1 Γ— 4
#>   statistic    df p.value missing.patterns
#>       <dbl> <dbl>   <dbl>            <int>
#> 1      35.1    14 0.00142                4

Contributions

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

Future Work

  • Extend the geom_miss_* family to include categorical variables, Bivariate plots: scatterplots, density overlays
  • SQL translation for databases
  • Big Data tools (sparklyr, sparklingwater)
  • Work well with other imputation engines / processes
  • Provide tools for assessing goodness of fit for classical approaches of MCAR, MAR, and MNAR (graphical inference from nullabor package)

Acknowledgements

Firstly, thanks to Di Cook for giving the initial inspiration for the package and laying down the rich theory and literature that the work in naniar is built upon. Naming credit (once again!) goes to Miles McBain. Among various other things, Miles also worked out how to overload the missing data and make it work as a geom. Thanks also to Colin Fay for helping me understand tidy evaluation and for features such as replace_to_na, miss_*_cumsum, and more.

A note on the name

naniar was previously named ggmissing and initially provided a ggplot geom and some other visualisations. ggmissing was changed to naniar to reflect the fact that this package is going to be bigger in scope, and is not just related to ggplot2. Specifically, the package is designed to provide a suite of tools for generating visualisations of missing values and imputations, manipulate, and summarise missing data.

…But why naniar?

Well, I think it is useful to think of missing values in data being like this other dimension, perhaps like C.S. Lewis’s Narnia - a different world, hidden away. You go inside, and sometimes it seems like you’ve spent no time in there but time has passed very quickly, or the opposite. Also, NAniar = na in r, and if you so desire, naniar may sound like β€œnoneoya” in an nz/aussie accent. Full credit to @MilesMcbain for the name, and @Hadley for the rearranged spelling.

More Repositories

1

rmd4sci

Rmarkdown for Scientists
RMarkdown
126
star
2

brolgar

BRowse Over Longitudinal Data Graphically and Analytically in R
R
105
star
3

syn

syn - the thesaurus
R
51
star
4

maxcovr

Tools in R to make it easier to solve the Maximal Coverage Location Problem
R
41
star
5

ukpolice

πŸ‡¬πŸ‡§ πŸš“ R package to pull police data from the uk police data repository πŸš“ πŸ‡¬πŸ‡§
R
35
star
6

qmd4sci

quarto for scientists
TeX
29
star
7

broomstick

🌲 broom helpers for decision tree methods (rpart, randomForest, and more!) 🌲
R
29
star
8

mmcc

Fast, tidy functions for mcmc diagnostics and summaries, built with data.table
R
23
star
9

monash-colour-in-graphics

Slides for my talk, "The use of colour in graphics"
CSS
21
star
10

ozviridis

Demonstrate the process of improving BoM heatmaps
R
19
star
11

rstudioconf20

CSS
18
star
12

palap

symmetric (reflective, palindromic) colour palettes (rev(palap) == palap)
R
18
star
13

user2018-missing-data-tutorial

HTML
14
star
14

treezy

🌴 Make handling decision trees easy. Treezy. 🌴
R
14
star
15

mputr

Package for handling multiple imputations in a tidy format
R
13
star
16

flipper

Make it easy to flip through R packages from CRAN, Bioconductor, and GitHub
R
13
star
17

genbib

Generate bib (.bib) bibliography files on the fly
R
12
star
18

ishihara

Create Ishihara plates in R
R
11
star
19

conmat

Create Contact Matrices from Population Data
R
11
star
20

rmd-errors

A collection of rmarkdown files with bugs, to practice solving.
11
star
21

ttiq-simulation

R
10
star
22

greta-course-notes

Course notes for greta
R
10
star
23

rjournal-brolgar

rjournal article for brolgar
R
9
star
24

neato

A set of function that I use somewhat regularly
R
9
star
25

cranalmanac

Access the Archives of CRAN
R
7
star
26

datadevtools

Development tools for sharing data
R
7
star
27

cranscan

scan cran for useful repositories with a Shiny app to help you swipe left/right
R
7
star
28

chaletex

Tools to extract latex packages from a .tex file and install them
TeX
7
star
29

A-Simple-Guide-to-S3-Methods

R Journal Submission: A short guide to using S3 Methods in R
TeX
6
star
30

burgr-reproducible-talk

Slides for the August 2016 burgr meetup about reproducibility
HTML
6
star
31

talk-unsw-rse

SCSS
5
star
32

marean

marean: A structure for (ma)king (re)producible (an)alysis
Makefile
5
star
33

ozroaddeaths

Access data from Australian Road Deaths Database
R
5
star
34

wombat19

CSS
4
star
35

rmd4sci-materials

Materials to get started for rmarkdown for scientists
TeX
4
star
36

bomr

An R package to make it easier to access data from the Australian Bureau of Meterology
HTML
4
star
37

njtcv

my CV
TeX
4
star
38

rcproposal2018

Proposal for RConsortium - https://www.r-consortium.org/projects/call-for-proposals
HTML
4
star
39

njt-talks

A GitHub Template for my slides
3
star
40

talk-user-2022

https://njt-user-2022.netlify.app/#1
SCSS
3
star
41

sinquote

replace strange quotes from places like gdocs/excel/msword with normal quotes
R
3
star
42

user-2018-maxcovr-talk

The repository for my user 2018 talk
HTML
3
star
43

ysc2019

Slides for talk on brolgar at YSC 2019
CSS
3
star
44

tierneyn.github.io

HTML
3
star
45

thesisdown-tufte

A repo for getting a thesis into tufte format using bookdown
HTML
3
star
46

talk-qut-ec

Greta talk at QUT Early Career Bayes
SCSS
2
star
47

target-pop-pyramid

R
2
star
48

teaching-data

repo containing datasets used in teaching
R
2
star
49

sirtensor

SIR simulation using tensorflow (work in progress)
R
2
star
50

website-md-cv

Borrowed markdown CV template from https://github.com/blmoore/md-cv
HTML
2
star
51

yahtsee

Yet Another Hierachical Time Series Extension and Expansion
R
2
star
52

explore-sir-targets

playing around with SIR model in targets: https://staff.math.su.se/hoehle/blog/2020/03/16/flatteningthecurve.html
R
2
star
53

jsm21

Slides for my talk at JSM 2021
SCSS
2
star
54

example-greta-targets

example greta workflow using targets
R
2
star
55

njt_bmj_md

Rmd for "Using decision trees to understand structure in missing data"
TeX
2
star
56

quokka

A NUMBATs / Monash meeting to discuss tidyeval, rlang, quosures, quasiquotation in an effort to make it simpler and easier to understand
HTML
2
star
57

numbat-data

repo for my data talk at NUMBAT
CSS
2
star
58

talk-canterbury-2022

Talk at University of Canterbury 2022
R
2
star
59

tidy-missing-data-paper

Repository for the paper "Expanding tidy data principles to facilitate missing data exploration, visualization and assessment of imputations"
HTML
2
star
60

quarto-gh-pages

quarto-gh-pages
TeX
1
star
61

dotfiles

Shell
1
star
62

limbodata

Example data for spatial data analysis project
R
1
star
63

yas

1
star
64

ozfire

Repository of fire related vis, pulling data from https://github.com/AusNZOpenRes/AusFires
1
star
65

jtt

Just Three Things: code from the screencast, "Just Three Things"
R
1
star
66

mex

This is the repository for the upcoming package mex, the (m)issingness (ex)plorer.
R
1
star
67

nt-park-status

How open are the NT parks?
R
1
star
68

ozvis19

CSS
1
star
69

rethinking

a repo on my work for the Statistical Rethinking book
HTML
1
star
70

rstudioconf22

rstudioconf 2022
1
star
71

talks

These talks are now here >>>
JavaScript
1
star
72

njtmisc

Misc functions I find useful
R
1
star
73

my-binder-test

R
1
star
74

freerange-covid

Converting Peter Ellis' COVID modelling code into a drake workflow
R
1
star
75

praiseme3

A package to deliver praise, because sometimes, it is just what we need.
R
1
star
76

aawt

Exploration of AAWT track and temperatures
R
1
star
77

ssa-2018-rethinking-teaching-computing

Slides for the SSA talk I gave at the SSA VIC 2018/07 meetup on Statistical Education
HTML
1
star
78

anu-seminar

CSS
1
star
79

ropensci-visdat

draft blog post
HTML
1
star
80

broombayes

broom helpers for Bayesian statistical models (BUGS, JAGS, STAN, and more)
R
1
star
81

codelens

Assist exploring and improving large code bases.
R
1
star
82

angletr-rmd-gh

Feedback form for the course: >>
TeX
1
star
83

example-quarto-book

TeX
1
star