• Stars
    star
    609
  • Rank 71,817 (Top 2 %)
  • Language
    C++
  • License
    Other
  • Created over 5 years ago
  • Updated 4 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Fast reading of delimited files

๐ŸŽ๐Ÿ’จvroom

R-CMD-check Codecov test coverage CRAN status Lifecycle: stable

The fastest delimited reader for R, 1.23 GB/sec.

But thatโ€™s impossible! How can it be so fast?

vroom doesnโ€™t stop to actually read all of your data, it simply indexes where each record is located so it can be read later. The vectors returned use the Altrep framework to lazily load the data on-demand when it is accessed, so you only pay for what you use. This lazy access is done automatically, so no changes to your R data-manipulation code are needed.

vroom also uses multiple threads for indexing, materializing non-character columns, and when writing to further improve performance.

package version time (sec) speedup throughput
vroom 1.5.1 1.36 53.30 1.23 GB/sec
data.table 1.14.0 5.83 12.40 281.65 MB/sec
readr 1.4.0 37.30 1.94 44.02 MB/sec
read.delim 4.1.0 72.31 1.00 22.71 MB/sec

Features

vroom has nearly all of the parsing features of readr for delimited and fixed width files, including

  • delimiter guessing*
  • custom delimiters (including multi-byte* and Unicode* delimiters)
  • specification of column types (including type guessing)
    • numeric types (double, integer, big integer*, number)
    • logical types
    • datetime types (datetime, date, time)
    • categorical types (characters, factors)
  • column selection, like dplyr::select()*
  • skipping headers, comments and blank lines
  • quoted fields
  • double and backslashed escapes
  • whitespace trimming
  • windows newlines
  • reading from multiple files or connections*
  • embedded newlines in headers and fields**
  • writing delimited files with as-needed quoting.
  • robust to invalid inputs (vroom has been extensively tested with the afl fuzz tester)*.

* these are additional features not in readr.

** requires num_threads = 1.

Installation

Install vroom from CRAN with:

install.packages("vroom")

Alternatively, if you need the development version from GitHub install it with:

# install.packages("pak")
pak::pak("tidyverse/vroom")

Usage

See getting started to jump start your use of vroom!

vroom uses the same interface as readr to specify column types.

vroom::vroom("mtcars.tsv",
  col_types = list(cyl = "i", gear = "f",hp = "i", disp = "_",
                   drat = "_", vs = "l", am = "l", carb = "i")
)
#> # A tibble: 32 ร— 10
#>   model           mpg   cyl    hp    wt  qsec vs    am    gear   carb
#>   <chr>         <dbl> <int> <int> <dbl> <dbl> <lgl> <lgl> <fct> <int>
#> 1 Mazda RX4      21       6   110  2.62  16.5 FALSE TRUE  4         4
#> 2 Mazda RX4 Wag  21       6   110  2.88  17.0 FALSE TRUE  4         4
#> 3 Datsun 710     22.8     4    93  2.32  18.6 TRUE  TRUE  4         1
#> # โ„น 29 more rows

Reading multiple files

vroom natively supports reading from multiple files (or even multiple connections!).

First we generate some files to read by splitting the nycflights dataset by airline.

library(nycflights13)
purrr::iwalk(
  split(flights, flights$carrier),
  ~ { .x$carrier[[1]]; vroom::vroom_write(.x, glue::glue("flights_{.y}.tsv"), delim = "\t") }
)

Then we can efficiently read them into one tibble by passing the filenames directly to vroom.

files <- fs::dir_ls(glob = "flights*tsv")
files
#> flights_9E.tsv flights_AA.tsv flights_AS.tsv flights_B6.tsv flights_DL.tsv 
#> flights_EV.tsv flights_F9.tsv flights_FL.tsv flights_HA.tsv flights_MQ.tsv 
#> flights_OO.tsv flights_UA.tsv flights_US.tsv flights_VX.tsv flights_WN.tsv 
#> flights_YV.tsv
vroom::vroom(files)
#> Rows: 336776 Columns: 19
#> โ”€โ”€ Column specification โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
#> Delimiter: "\t"
#> chr   (4): carrier, tailnum, origin, dest
#> dbl  (14): year, month, day, dep_time, sched_dep_time, dep_delay, arr_time, ...
#> dttm  (1): time_hour
#> 
#> โ„น Use `spec()` to retrieve the full column specification for this data.
#> โ„น Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 336,776 ร— 19
#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>   <dbl> <dbl> <dbl>    <dbl>          <dbl>     <dbl>    <dbl>          <dbl>
#> 1  2013     1     1      810            810         0     1048           1037
#> 2  2013     1     1     1451           1500        -9     1634           1636
#> 3  2013     1     1     1452           1455        -3     1637           1639
#> # โ„น 336,773 more rows
#> # โ„น 11 more variables: arr_delay <dbl>, carrier <chr>, flight <dbl>,
#> #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#> #   hour <dbl>, minute <dbl>, time_hour <dttm>

Learning more

Benchmarks

The speed quoted above is from a real 1.53G dataset with 14,388,451 rows and 11 columns, see the benchmark article for full details of the dataset and bench/ for the code used to retrieve the data and perform the benchmarks.

Environment variables

In addition to the arguments to the vroom() function, you can control the behavior of vroom with a few environment variables. Generally these will not need to be set by most users.

  • VROOM_TEMP_PATH - Path to the directory used to store temporary files when reading from a R connection. If unset defaults to the R sessionโ€™s temporary directory (tempdir()).
  • VROOM_THREADS - The number of processor threads to use when indexing and parsing. If unset defaults to parallel::detectCores().
  • VROOM_SHOW_PROGRESS - Whether to show the progress bar when indexing. Regardless of this setting the progress bar is disabled in non-interactive settings, R notebooks, when running tests with testthat and when knitting documents.
  • VROOM_CONNECTION_SIZE - The size (in bytes) of the connection buffer when reading from connections (default is 128 KiB).
  • VROOM_WRITE_BUFFER_LINES - The number of lines to use for each buffer when writing files (default: 1000).

There are also a family of variables to control use of the Altrep framework. For versions of R where the Altrep framework is unavailable (R < 3.5.0) they are automatically turned off and the variables have no effect. The variables can take one of true, false, TRUE, FALSE, 1, or 0.

  • VROOM_USE_ALTREP_NUMERICS - If set use Altrep for all numeric types (default false).

There are also individual variables for each type. Currently only VROOM_USE_ALTREP_CHR defaults to true.

  • VROOM_USE_ALTREP_CHR
  • VROOM_USE_ALTREP_FCT
  • VROOM_USE_ALTREP_INT
  • VROOM_USE_ALTREP_BIG_INT
  • VROOM_USE_ALTREP_DBL
  • VROOM_USE_ALTREP_NUM
  • VROOM_USE_ALTREP_LGL
  • VROOM_USE_ALTREP_DTTM
  • VROOM_USE_ALTREP_DATE
  • VROOM_USE_ALTREP_TIME

RStudio caveats

RStudioโ€™s environment pane calls object.size() when it refreshes the pane, which for Altrep objects can be extremely slow. RStudio 1.2.1335+ includes the fixes (RStudio#4210, RStudio#4292) for this issue, so it is recommended you use at least that version.

Thanks

More Repositories

1

ggplot2

An implementation of the Grammar of Graphics in R
R
6,368
star
2

dplyr

dplyr: A grammar of data manipulation
R
4,675
star
3

tidyverse

Easily install and load packages from the tidyverse
R
1,610
star
4

rvest

Simple web scraping for R
R
1,481
star
5

tidyr

Tidy Messy Data
R
1,347
star
6

purrr

A functional programming toolkit for R
R
1,235
star
7

readr

Read flat files (csv, tsv, fwf) into R
R
998
star
8

magrittr

Improve the readability of R code with the pipe
R
955
star
9

datascience-box

Data Science Course in a Box
JavaScript
908
star
10

reprex

Render bits of R code for sharing, e.g., on GitHub or StackOverflow.
R
732
star
11

lubridate

Make working with dates in R just that little bit easier
R
721
star
12

readxl

Read excel files (.xls and .xlsx) into R ๐Ÿ–‡
C++
719
star
13

glue

Glue strings to data in R. Small, fast, dependency free interpreted string literals.
R
689
star
14

dtplyr

Data table backend for dplyr
R
656
star
15

tibble

A modern re-imagining of the data frame
R
649
star
16

multidplyr

A dplyr backend that partitions a data frame over multiple processes
R
638
star
17

stringr

A fresh approach to string manipulation in R
R
583
star
18

forcats

๐Ÿˆ๐Ÿˆ๐Ÿˆ๐Ÿˆ: tools for working with categorical variables (factors)
R
538
star
19

dbplyr

Database (DBI) backend for dplyr
R
466
star
20

haven

Read SPSS, Stata and SAS files from R
C
421
star
21

modelr

Helper functions for modelling
R
399
star
22

googlesheets4

Google Spreadsheets R API (reboot of the googlesheets package)
R
354
star
23

googledrive

Google Drive R API
R
316
star
24

style

The tidyverse style guide for R code
HTML
290
star
25

design

Tidyverse design principles
R
211
star
26

tidyverse.org

Source of tidyverse.org
HTML
189
star
27

hms

A simple class for storing time-of-day values
R
137
star
28

nycflights13

An R data package containing all out-bound flights from NYC in 2013 + useful metdata
R
124
star
29

tidyversedashboard

Tidyverse activity dashboard
R
71
star
30

tidy-dev-day

Tidyverse developer day
60
star
31

tidyeval

A guide to tidy evaluation
CSS
54
star
32

dsbox

Companion R package to Data Science Course in a Box
R
48
star
33

tidytemplate

A pkgdown template for core tidyverse packages
SCSS
46
star
34

blob

A simple S3 class for representing BLOBs
R
44
star
35

code-review

32
star
36

funs

Collection of low-level functions for working with vctrs
R
31
star
37

website-analytics

Web analytics for tidyverse + r-lib sites
R
28
star
38

tidyups

21
star
39

ggplot2-docs

ggplot2 documentation. Auto-generated from ggplot2 sources by pkgdown
HTML
10
star