• Stars
    star
    389
  • Rank 106,622 (Top 3 %)
  • Language
    R
  • License
    Other
  • Created over 1 year ago
  • Updated 11 days ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Bring polars to R

polars

R-universe status badge CRAN status Dev R-CMD-check Docs release

The polars package for R gives users access to a lightning fast Data Frame library written in Rust. Polarsโ€™ embarrassingly parallel execution, cache efficient algorithms and expressive API makes it perfect for efficient data wrangling, data pipelines, snappy APIs, and much more besides. Polars also supports โ€œstreaming modeโ€ for out-of-memory operations. This allows users to analyze datasets many times larger than RAM.

Documentation can be found on the r-polars homepage.

The primary developer of the upstream Polars project is Ritchie Vink (@ritchie46). This R port is maintained by Sรธren Welling (@sorhawell) and contributors. Consider joining our Discord (subchannel) for additional help and discussion.

Install

The package can be installed from R-universe, or GitHub.

Some platforms can install pre-compiled binaries, and others will need to build from source.

R-universe

R-universe provides pre-compiled polars binaries for Windows (x86_64), macOS (x86_64) and Ubuntu 22.04 (x86_64) with source builds for other platforms.

Binary packages on R-universe are compiled by stable Rust, with nightly features disabled.

install.packages("polars", repos = "https://rpolars.r-universe.dev")
# For Ubuntu binary installation
install.packages("polars", repos = "https://rpolars.r-universe.dev/bin/linux/jammy/4.3")

Special thanks to Jeroen Ooms (@jeroen) for the excellent R-universe support.

GitHub releases

We also provide pre-compiled binaries for various operating systems on our GitHub releases page. You can download and install these files manually, or install directly from R. Simply match the URL for your operating system and the desired release. For example, to install the latest release of polars on one can use:

Linux (x86_64)

install.packages(
  "https://github.com/pola-rs/r-polars/releases/latest/download/polars__x86_64-pc-linux-gnu.gz",
  repos = NULL
)

Windows

install.packages(
  "https://github.com/pola-rs/r-polars/releases/latest/download/polars.zip",
  repos = NULL
)

macOS(x86_64)

install.packages(
  "https://github.com/pola-rs/r-polars/releases/latest/download/polars__x86_64-apple-darwin20.tgz",
  repos = NULL
)

Just remember to invoke the repos = NULL argument if you are installing these binary builds directly from within R.

Binary packages on GitHub releases are compiled by nightly Rust, with nightly features enabled.

Build from source

For source installation, the Rust toolchain (Rust 1.70 or later) must be configured.

Currently you should install rust >=1.70 or nightly-2023-07-27 (for full features (simd)).

Please check the https://github.com/r-rust/hellorust repository for about Rust code in R packages.

During source installation, some environment variables can be set to enable Rust features and profile changes.

  • RPOLARS_FULL_FEATURES="true" (Build with nightly feature enabled, requires Rust toolchain nightly-2023-07-27)
  • RPOLARS_PROFILE="release-optimized" (Build with more optimization)

Quickstart example

The Get Started vignette (vignette("polars")) contains a series of detailed examples, but here is a quick illustration.

polars is a very powerful package with many functions. To avoid conflicts with other packages and base R function names, polarsโ€™s top level functions are hosted in the pl namespace, and accessible via the pl$ prefix. To convert an R data frame to a Polars DataFrame, we call:

library(polars)

dat = pl$DataFrame(mtcars)
dat
#> shape: (32, 11)
#> โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”
#> โ”‚ mpg  โ”† cyl โ”† disp  โ”† hp    โ”† โ€ฆ โ”† vs  โ”† am  โ”† gear โ”† carb โ”‚
#> โ”‚ ---  โ”† --- โ”† ---   โ”† ---   โ”†   โ”† --- โ”† --- โ”† ---  โ”† ---  โ”‚
#> โ”‚ f64  โ”† f64 โ”† f64   โ”† f64   โ”†   โ”† f64 โ”† f64 โ”† f64  โ”† f64  โ”‚
#> โ•žโ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•ก
#> โ”‚ 21.0 โ”† 6.0 โ”† 160.0 โ”† 110.0 โ”† โ€ฆ โ”† 0.0 โ”† 1.0 โ”† 4.0  โ”† 4.0  โ”‚
#> โ”‚ 21.0 โ”† 6.0 โ”† 160.0 โ”† 110.0 โ”† โ€ฆ โ”† 0.0 โ”† 1.0 โ”† 4.0  โ”† 4.0  โ”‚
#> โ”‚ 22.8 โ”† 4.0 โ”† 108.0 โ”† 93.0  โ”† โ€ฆ โ”† 1.0 โ”† 1.0 โ”† 4.0  โ”† 1.0  โ”‚
#> โ”‚ 21.4 โ”† 6.0 โ”† 258.0 โ”† 110.0 โ”† โ€ฆ โ”† 1.0 โ”† 0.0 โ”† 3.0  โ”† 1.0  โ”‚
#> โ”‚ โ€ฆ    โ”† โ€ฆ   โ”† โ€ฆ     โ”† โ€ฆ     โ”† โ€ฆ โ”† โ€ฆ   โ”† โ€ฆ   โ”† โ€ฆ    โ”† โ€ฆ    โ”‚
#> โ”‚ 15.8 โ”† 8.0 โ”† 351.0 โ”† 264.0 โ”† โ€ฆ โ”† 0.0 โ”† 1.0 โ”† 5.0  โ”† 4.0  โ”‚
#> โ”‚ 19.7 โ”† 6.0 โ”† 145.0 โ”† 175.0 โ”† โ€ฆ โ”† 0.0 โ”† 1.0 โ”† 5.0  โ”† 6.0  โ”‚
#> โ”‚ 15.0 โ”† 8.0 โ”† 301.0 โ”† 335.0 โ”† โ€ฆ โ”† 0.0 โ”† 1.0 โ”† 5.0  โ”† 8.0  โ”‚
#> โ”‚ 21.4 โ”† 4.0 โ”† 121.0 โ”† 109.0 โ”† โ€ฆ โ”† 1.0 โ”† 1.0 โ”† 4.0  โ”† 2.0  โ”‚
#> โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”˜

This DataFrame object can be manipulated using many of the usual R functions and accessors, e.g.:

dat[1:4, c("mpg", "qsec", "hp")]
#> shape: (4, 3)
#> โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
#> โ”‚ mpg  โ”† qsec  โ”† hp    โ”‚
#> โ”‚ ---  โ”† ---   โ”† ---   โ”‚
#> โ”‚ f64  โ”† f64   โ”† f64   โ”‚
#> โ•žโ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•ก
#> โ”‚ 21.0 โ”† 16.46 โ”† 110.0 โ”‚
#> โ”‚ 21.0 โ”† 17.02 โ”† 110.0 โ”‚
#> โ”‚ 22.8 โ”† 18.61 โ”† 93.0  โ”‚
#> โ”‚ 21.4 โ”† 19.44 โ”† 110.0 โ”‚
#> โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

However, the true power of Polars is unlocked by using methods, which are encapsulated in the DataFrame object itself. For example, we can chain the $groupby() and the $mean() methods to compute group-wise means for each column of the dataset:

dat$groupby("cyl", maintain_order = TRUE)$mean()
#> shape: (3, 11)
#> โ”Œโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
#> โ”‚ cyl โ”† mpg       โ”† disp       โ”† hp         โ”† โ€ฆ โ”† vs       โ”† am       โ”† gear     โ”† carb     โ”‚
#> โ”‚ --- โ”† ---       โ”† ---        โ”† ---        โ”†   โ”† ---      โ”† ---      โ”† ---      โ”† ---      โ”‚
#> โ”‚ f64 โ”† f64       โ”† f64        โ”† f64        โ”†   โ”† f64      โ”† f64      โ”† f64      โ”† f64      โ”‚
#> โ•žโ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ก
#> โ”‚ 6.0 โ”† 19.742857 โ”† 183.314286 โ”† 122.285714 โ”† โ€ฆ โ”† 0.571429 โ”† 0.428571 โ”† 3.857143 โ”† 3.428571 โ”‚
#> โ”‚ 4.0 โ”† 26.663636 โ”† 105.136364 โ”† 82.636364  โ”† โ€ฆ โ”† 0.909091 โ”† 0.727273 โ”† 4.090909 โ”† 1.545455 โ”‚
#> โ”‚ 8.0 โ”† 15.1      โ”† 353.1      โ”† 209.214286 โ”† โ€ฆ โ”† 0.0      โ”† 0.142857 โ”† 3.285714 โ”† 3.5      โ”‚
#> โ””โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Note that we use maintain_order = TRUE so that polars always keeps the groups in the same order as they are in the original data.

The polars vignette contains many more examples of how to use the package to:

  • Read CSV, JSON, Parquet, and other file formats.
  • Filter rows and select columns.
  • Modify and create new columns.
  • Group by and aggregate.
  • Reshape data.
  • Join and concatenate different datasets.
  • Sort data.
  • Work with dates and times.
  • Handle missing values.
  • Use the lazy execution engine for maximum performance and memory-efficient operations.
  • Etc.

Development and Contributions

Contributions are very welcome!

As of March 2023, polars has now reached nearly 100% coverage of the underlying โ€œlazyโ€ Expr syntax. While translation of the โ€œeagerโ€ syntax is still a little further behind, you should be able to do just about everything using $select() + $with_columns(). Most of the methods associated with DataFrame and LazyFrame classes have been implemented, but not all. There is still much to do, and your help would be much appreciated!

If you spot missing functionalityโ€”implemented in Python but not Rโ€”please let us know on GitHub.

System dependencies

To install the development version of Polars or develop new features, you will to install the Rust toolchain:

  • Install rustup, the cross-platform Rust installer. Then:

    rustup toolchain install nightly-2023-07-27
    rustup default nightly-2023-07-27
  • Windows: Make sure the latest version of Rtools is installed and on your PATH.

  • macOS: Make sure Xcode is installed.

  • Install CMake and add it to your PATH.

Implementing new features

Here are the steps required for an example contribution, where we are implementing the cosine expression:

  • Look up the polars.Expr.cos method in py-polars documentation.
  • Press the [source] button to see the Python implementation
  • Find the cos py-polars rust implementation (likely just a simple call to the Rust-Polars API)
  • Adapt the Rust part and place it here.
  • Adapt the Python frontend syntax to R and place it here. Add the roxygen docs + examples above.
  • Notice we use Expr_cos = "use_extendr_wrapper", it means weโ€™re just using unmodified the extendr auto-generated wrapper
  • Write a test here.
  • Run renv::restore() and resolve all R packages
  • Run rextendr::document() to recompile and confirm the added method functions as intended, e.g.ย pl$DataFrame(a=c(0,pi/2,pi,NA_real_))$select(pl$col("a")$cos())
  • Run devtools::test(). See below for how to set up your development environment correctly.

Note that PRs to polars will be automatically be built and tested on all platforms as part of our GitHub Actions workflow. A more detailed description of the development environment and workflow for local builds is provided below.

Development workflow

Assuming the system dependencies have been met (above), the typical polars development workflow is as follows:

Step 1: Fork the polars repo on GitHub and then clone it locally.

git clone [email protected]:<YOUR-GITHUB-ACCOUNT>/r-polars.git
cd r-polars

Step 2: Build the package and install the suggested package dependencies.

  • Option A: Using devtools.

    Rscript -e 'devtools::install(pkg = ".", dependencies = TRUE)' 
  • Option B: Using renv.

    # Rscript -e 'install.packages("renv")'
    Rscript -e 'renv::activate(); renv::restore()'

Step 3: Make your proposed changes to the R and/or Rust code. Donโ€™t forget to run:

rextendr::document() # compile Rust code + update wrappers & docs
devtools::test()     # run all unit tests

Step 4 (optional): Build the package locally.

R CMD INSTALL --no-multiarch --with-keep.source .

Step 5: Commit your changes and submit a PR to the main polars repo.

  • As aside, notice that ./renv.lock sets all R packages during the server build.

Tip: To speed up the local rextendr::document() or R CMD check, run the following:

source("inst/misc/develop_polars.R")

#to rextendr:document() + not_cran + load packages + all_features
load_polars()

#to check package + reuses previous compilation in check, protects against deletion
check_polars() #assumes rust target at `paste0(getwd(),"/src/rust")`
  • The RPOLARS_RUST_SOURCE environment variable allows polars to recover the Cargo cache even if source files have been moved. Replace with your own absolute path to your local clone!
  • filter_rcmdcheck.R removes known warnings from final check report.
  • unlink("check") cleans up.

Misc

If you experience unexpected sluggish performance, when using polars in a given IDE, weโ€™d like to hear about it. You can try to activate pl$set_polars_options(debug_polars = TRUE) to profile what methods are being touched (not necessarily run) and how fast. Below is an example of good behavior.

#run e.g. an eager query after setting debug_polars = TRUE
pl$DataFrame(iris)$select("Species")

[TIME? ms]
pl$DataFrame() -> [0.73ms]
   .pr$DataFrame$new_with_capacity() -> [0.56ms]
   .pr$DataFrame$set_column_from_robj() -> [11.04ms]
   .pr$DataFrame$set_column_from_robj() -> [0.3309ms]
   .pr$DataFrame$set_column_from_robj() -> [0.283ms]
   .pr$DataFrame$set_column_from_robj() -> [0.2761ms]
   .pr$DataFrame$set_column_from_robj() -> [12.54ms]
DataFrame$select() -> [0.3681ms]
ProtoExprArray$push_back_rexpr() -> [0.21ms]
pl$col() -> [0.1669ms]
   .pr$Expr$col() -> [0.212ms]
   .pr$DataFrame$select() -> [1.229ms]
DataFrame$print() -> [0.1781ms]
   .pr$DataFrame$print() -> shape: (150, 1)
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Species   โ”‚
โ”‚ ---       โ”‚
โ”‚ cat       โ”‚
โ•žโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ก
โ”‚ setosa    โ”‚
โ”‚ setosa    โ”‚
โ”‚ setosa    โ”‚
โ”‚ setosa    โ”‚
โ”‚ โ€ฆ         โ”‚
โ”‚ virginica โ”‚
โ”‚ virginica โ”‚
โ”‚ virginica โ”‚
โ”‚ virginica โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜