• Stars
    star
    656
  • Rank 67,406 (Top 2 %)
  • Language
    R
  • License
    Other
  • Created over 8 years ago
  • Updated 5 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Data table backend for dplyr

dtplyr

CRAN status R-CMD-check Codecov test coverage

Overview

dtplyr provides a data.table backend for dplyr. The goal of dtplyr is to allow you to write dplyr code that is automatically translated to the equivalent, but usually much faster, data.table code.

See vignette("translation") for details of the current translations, and table.express and rqdatatable for related work.

Installation

You can install from CRAN with:

install.packages("dtplyr")

Or try the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("tidyverse/dtplyr")

Usage

To use dtplyr, you must at least load dtplyr and dplyr. You may also want to load data.table so you can access the other goodies that it provides:

library(data.table)
library(dtplyr)
library(dplyr, warn.conflicts = FALSE)

Then use lazy_dt() to create a β€œlazy” data table that tracks the operations performed on it.

mtcars2 <- lazy_dt(mtcars)

You can preview the transformation (including the generated data.table code) by printing the result:

mtcars2 %>% 
  filter(wt < 5) %>% 
  mutate(l100k = 235.21 / mpg) %>% # liters / 100 km
  group_by(cyl) %>% 
  summarise(l100k = mean(l100k))
#> Source: local data table [3 x 2]
#> Call:   `_DT1`[wt < 5][, `:=`(l100k = 235.21/mpg)][, .(l100k = mean(l100k)), 
#>     keyby = .(cyl)]
#> 
#>     cyl l100k
#>   <dbl> <dbl>
#> 1     4  9.05
#> 2     6 12.0 
#> 3     8 14.9 
#> 
#> # Use as.data.table()/as.data.frame()/as_tibble() to access results

But generally you should reserve this only for debugging, and use as.data.table(), as.data.frame(), or as_tibble() to indicate that you’re done with the transformation and want to access the results:

mtcars2 %>% 
  filter(wt < 5) %>% 
  mutate(l100k = 235.21 / mpg) %>% # liters / 100 km
  group_by(cyl) %>% 
  summarise(l100k = mean(l100k)) %>% 
  as_tibble()
#> # A tibble: 3 Γ— 2
#>     cyl l100k
#>   <dbl> <dbl>
#> 1     4  9.05
#> 2     6 12.0 
#> 3     8 14.9

Why is dtplyr slower than data.table?

There are two primary reasons that dtplyr will always be somewhat slower than data.table:

  • Each dplyr verb must do some work to convert dplyr syntax to data.table syntax. This takes time proportional to the complexity of the input code, not the input data, so should be a negligible overhead for large datasets. Initial benchmarks suggest that the overhead should be under 1ms per dplyr call.

  • To match dplyr semantics, mutate() does not modify in place by default. This means that most expressions involving mutate() must make a copy that would not be necessary if you were using data.table directly. (You can opt out of this behaviour in lazy_dt() with immutable = FALSE).

Code of Conduct

Please note that the dtplyr project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

More Repositories

1

ggplot2

An implementation of the Grammar of Graphics in R
R
6,402
star
2

dplyr

dplyr: A grammar of data manipulation
R
4,675
star
3

tidyverse

Easily install and load packages from the tidyverse
R
1,610
star
4

rvest

Simple web scraping for R
R
1,481
star
5

tidyr

Tidy Messy Data
R
1,347
star
6

purrr

A functional programming toolkit for R
R
1,235
star
7

readr

Read flat files (csv, tsv, fwf) into R
R
998
star
8

magrittr

Improve the readability of R code with the pipe
R
955
star
9

datascience-box

Data Science Course in a Box
JavaScript
908
star
10

reprex

Render bits of R code for sharing, e.g., on GitHub or StackOverflow.
R
732
star
11

lubridate

Make working with dates in R just that little bit easier
R
721
star
12

readxl

Read excel files (.xls and .xlsx) into R πŸ–‡
C++
719
star
13

glue

Glue strings to data in R. Small, fast, dependency free interpreted string literals.
R
689
star
14

tibble

A modern re-imagining of the data frame
R
649
star
15

multidplyr

A dplyr backend that partitions a data frame over multiple processes
R
638
star
16

vroom

Fast reading of delimited files
C++
609
star
17

stringr

A fresh approach to string manipulation in R
R
583
star
18

forcats

🐈🐈🐈🐈: tools for working with categorical variables (factors)
R
538
star
19

dbplyr

Database (DBI) backend for dplyr
R
466
star
20

haven

Read SPSS, Stata and SAS files from R
C
423
star
21

modelr

Helper functions for modelling
R
399
star
22

googlesheets4

Google Spreadsheets R API (reboot of the googlesheets package)
R
354
star
23

googledrive

Google Drive R API
R
316
star
24

style

The tidyverse style guide for R code
HTML
290
star
25

design

Tidyverse design principles
R
211
star
26

tidyverse.org

Source of tidyverse.org
HTML
189
star
27

hms

A simple class for storing time-of-day values
R
137
star
28

nycflights13

An R data package containing all out-bound flights from NYC in 2013 + useful metdata
R
124
star
29

tidyversedashboard

Tidyverse activity dashboard
R
71
star
30

tidy-dev-day

Tidyverse developer day
60
star
31

tidyeval

A guide to tidy evaluation
CSS
54
star
32

dsbox

Companion R package to Data Science Course in a Box
R
48
star
33

tidytemplate

A pkgdown template for core tidyverse packages
SCSS
46
star
34

blob

A simple S3 class for representing BLOBs
R
44
star
35

code-review

32
star
36

funs

Collection of low-level functions for working with vctrs
R
31
star
37

website-analytics

Web analytics for tidyverse + r-lib sites
R
28
star
38

tidyups

21
star
39

ggplot2-docs

ggplot2 documentation. Auto-generated from ggplot2 sources by pkgdown
HTML
10
star