• Stars
    star
    649
  • Rank 65,365 (Top 2 %)
  • Language
    R
  • License
    Other
  • Created almost 8 years ago
  • Updated 2 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Data table backend for dplyr

dtplyr

CRAN status R-CMD-check Codecov test coverage

Overview

dtplyr provides a data.table backend for dplyr. The goal of dtplyr is to allow you to write dplyr code that is automatically translated to the equivalent, but usually much faster, data.table code.

See vignette("translation") for details of the current translations, and table.express and rqdatatable for related work.

Installation

You can install from CRAN with:

install.packages("dtplyr")

Or try the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("tidyverse/dtplyr")

Usage

To use dtplyr, you must at least load dtplyr and dplyr. You may also want to load data.table so you can access the other goodies that it provides:

library(data.table)
library(dtplyr)
library(dplyr, warn.conflicts = FALSE)

Then use lazy_dt() to create a โ€œlazyโ€ data table that tracks the operations performed on it.

mtcars2 <- lazy_dt(mtcars)

You can preview the transformation (including the generated data.table code) by printing the result:

mtcars2 %>% 
  filter(wt < 5) %>% 
  mutate(l100k = 235.21 / mpg) %>% # liters / 100 km
  group_by(cyl) %>% 
  summarise(l100k = mean(l100k))
#> Source: local data table [3 x 2]
#> Call:   `_DT1`[wt < 5][, `:=`(l100k = 235.21/mpg)][, .(l100k = mean(l100k)), 
#>     keyby = .(cyl)]
#> 
#>     cyl l100k
#>   <dbl> <dbl>
#> 1     4  9.05
#> 2     6 12.0 
#> 3     8 14.9 
#> 
#> # Use as.data.table()/as.data.frame()/as_tibble() to access results

But generally you should reserve this only for debugging, and use as.data.table(), as.data.frame(), or as_tibble() to indicate that youโ€™re done with the transformation and want to access the results:

mtcars2 %>% 
  filter(wt < 5) %>% 
  mutate(l100k = 235.21 / mpg) %>% # liters / 100 km
  group_by(cyl) %>% 
  summarise(l100k = mean(l100k)) %>% 
  as_tibble()
#> # A tibble: 3 ร— 2
#>     cyl l100k
#>   <dbl> <dbl>
#> 1     4  9.05
#> 2     6 12.0 
#> 3     8 14.9

Why is dtplyr slower than data.table?

There are two primary reasons that dtplyr will always be somewhat slower than data.table:

  • Each dplyr verb must do some work to convert dplyr syntax to data.table syntax. This takes time proportional to the complexity of the input code, not the input data, so should be a negligible overhead for large datasets. Initial benchmarks suggest that the overhead should be under 1ms per dplyr call.

  • To match dplyr semantics, mutate() does not modify in place by default. This means that most expressions involving mutate() must make a copy that would not be necessary if you were using data.table directly. (You can opt out of this behaviour in lazy_dt() with immutable = FALSE).

Code of Conduct

Please note that the dtplyr project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

More Repositories

1

ggplot2

An implementation of the Grammar of Graphics in R
R
6,212
star
2

dplyr

dplyr: A grammar of data manipulation
R
4,583
star
3

tidyverse

Easily install and load packages from the tidyverse
R
1,542
star
4

rvest

Simple web scraping for R
R
1,428
star
5

tidyr

Tidy Messy Data
R
1,298
star
6

purrr

A functional programming toolkit for R
R
1,202
star
7

readr

Read flat files (csv, tsv, fwf) into R
R
971
star
8

magrittr

Improve the readability of R code with the pipe
R
946
star
9

datascience-box

Data Science Course in a Box
JavaScript
878
star
10

reprex

Render bits of R code for sharing, e.g., on GitHub or StackOverflow.
R
719
star
11

lubridate

Make working with dates in R just that little bit easier
R
705
star
12

readxl

Read excel files (.xls and .xlsx) into R ๐Ÿ–‡
C++
699
star
13

glue

Glue strings to data in R. Small, fast, dependency free interpreted string literals.
R
670
star
14

multidplyr

A dplyr backend that partitions a data frame over multiple processes
R
632
star
15

tibble

A modern re-imagining of the data frame
R
629
star
16

vroom

Fast reading of delimited files
C++
597
star
17

stringr

A fresh approach to string manipulation in R
R
555
star
18

forcats

๐Ÿˆ๐Ÿˆ๐Ÿˆ๐Ÿˆ: tools for working with categorical variables (factors)
R
528
star
19

dbplyr

Database (DBI) backend for dplyr
R
446
star
20

haven

Read SPSS, Stata and SAS files from R
C
415
star
21

modelr

Helper functions for modelling
R
398
star
22

googlesheets4

Google Spreadsheets R API (reboot of the googlesheets package)
R
341
star
23

googledrive

Google Drive R API
R
308
star
24

style

The tidyverse style guide for R code
HTML
281
star
25

design

Tidyverse design principles
R
203
star
26

tidyverse.org

Source of tidyverse.org
HTML
182
star
27

hms

A simple class for storing time-of-day values
R
136
star
28

nycflights13

An R data package containing all out-bound flights from NYC in 2013 + useful metdata
R
120
star
29

tidyversedashboard

Tidyverse activity dashboard
R
70
star
30

tidy-dev-day

Tidyverse developer day
57
star
31

tidyeval

A guide to tidy evaluation
CSS
53
star
32

dsbox

Companion R package to Data Science Course in a Box
R
46
star
33

blob

A simple S3 class for representing BLOBs
R
42
star
34

tidytemplate

A pkgdown template for core tidyverse packages
SCSS
42
star
35

funs

Collection of low-level functions for working with vctrs
R
30
star
36

website-analytics

Web analytics for tidyverse + r-lib sites
R
27
star
37

code-review

24
star
38

tidyups

19
star
39

ggplot2-docs

ggplot2 documentation. Auto-generated from ggplot2 sources by pkgdown
HTML
9
star