• Stars
    star
    128
  • Rank 265,519 (Top 6 %)
  • Language
    C++
  • License
    Other
  • Created almost 5 years ago
  • Updated 4 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Fast map implementation for R

fastmap

CRAN status R build status

fastmap implements the following data structures for R:

  • fastmap: maps (key-value store)
  • faststack: stacks
  • fastqueue: queues

The usual way of implementing maps in R is to use environments. However, this method is problematic when using a large set of keys or randomly-generated keys, because each time you use a key or even check for the existence of a key using exists(), that key is interned as a symbol and stored in the R symbol table, which is never garbage-collected. This means that every time you use a new key -- whether it is to store an object or just check whether the key exists in the environment, R leaks a little memory. If you have a relatively small, fixed set of keys, or if your R process is a short-running process, this may not be a problem. But if, for example, you have a long-running R process that uses random keys, then the memory leakage can cause a noticeable increase in memory usage. Also, when R's symbol table is large, garbage collection events, which occur regularly, take more time, reducing R's performance in general. (See the Memory leak examples section of this document for more information.)

fastmap solves this problem by storing the keys as C++ std::string objects, and so it does not use the R symbol table at all. The values are stored in a list so that R knows not to garbage-collect them. In C++, fastmap uses a tsl::hopscotch_map (which is similar to std::unordered_map) to map from the keys to indices in the list of values.

Installation

install.packages("fastmap")

Usage

fastmap()

library(fastmap)

# Create a map
m <- fastmap()

# Set some key-value pairs
m$set("x", 100)
m$set("letters", c("a", "b", "c"))
m$mset(numbers = c(10, 20, 30), nothing = NULL)

# Get values using keys
m$get("x")
#> [1] 100
m$get("numbers")
#> [1] 10 20 30
m$mget(c("letters", "numbers"))
#> $letters
#> [1] "a" "b" "c"
#>
#> $numbers
#> [1] 10 20 30

# Missing keys return NULL by default, but this can be customized
m$get("xyz")
#> NULL

# Check for existence of keys
m$has("x")
#> [1] TRUE
m$has("nothing")
#> [1] TRUE
m$has("xyz")
#> [1] FALSE

# Remove one or more items
m$remove(c("letters", "x"))

# Return number of items
m$size()
#> [1] 2

# Get all keys
m$keys()
#> [1] "nothing" "numbers"

# Return named list that represents all key-value pairs
str(m$as_list())
#> List of 3
#>  $ nothing: NULL
#>  $ numbers: num [1:3] 10 20 30

# Clear the map
m$reset()

# Create a copy of the fastmap
m1 <- m$clone()

By default, get() returns NULL for keys that aren't present. You can instead specify a sentinel value to return for missing keys, either when the fastmap is created, or when get() is called. For example, you can return a key_missing() object to represent missing values:

# Specify missing value when get() is called
m <- fastmap()
m$get("x", missing = key_missing())
#> <Key Missing>

# Specify the default missing value
m <- fastmap(missing_default = key_missing())
m$get("x")
#> <Key Missing>

faststack()

s <- faststack()
s$push(10)
s$mpush(11, 12, 13)
s$mpush(.list = list(14, 15))

s$pop()
#> [1] 15

str(s$mpop(3))
#> List of 3
#>  $ : num 14
#>  $ : num 13
#>  $ : num 12

s$peek()
#> [1] 11

s$size()
#> [1] 2

# Get the stack in list form. Note that the order is the opposite of $mpop()
str(s$as_list())
#> List of 2
#>  $ : num 10
#>  $ : num 11

s$reset()

By default, popping from an empty stack returns NULL, but you can specify other values.

s$pop()
#> NULL

# Can specify the default missing value at creation.
s <- faststack(missing_default = key_missing())
s$pop()
#> <Key Missing>

# Can specify a missing value when $pop is called
s$pop(missing = "nope")
#> [1] "nope"

fastqueue()

q <- fastqueue()
q$add(10)
q$madd(11, 12, 13)
q$madd(.list = list(14, 15))

q$remove()
#> [1] 10

str(q$mremove(3))
#> List of 3
#>  $ : num 11
#>  $ : num 12
#>  $ : num 13

q$peek()
#> [1] 14

q$size()
#> [1] 2

# Get the queue in list form.
str(q$as_list())
#> List of 2
#>  $ : num 14
#>  $ : num 15

q$reset()

By default, removing from an empty queue returns NULL, but you can specify other values.

q$remove()
#> NULL

# Can specify the default missing value at creation.
q <- fastqueue(missing_default = key_missing())
q$remove()
#> <Key Missing>

# Can specify a missing value when $pop is called
q$remove(missing = "nope")
#> [1] "nope"

Notes on fastmap objects

Key ordering

When you call m$keys() or m$as_list(), the items are returned in an arbitrary order. Keep in mind that there is no guarantee that the order will be the same across platforms, or across different builds of fastmap.

If you want to guarantee a particular order, you can call m$keys(sort=TRUE) or m$as_list(sort=TRUE). The result will be a locale-independent sorting of the keys by their Unicode code point values. For example, Γ© (Unicode code point 233) comes after z (122). If you want the keys to be sorted a different way, you will need to sort them yourself.

Serialization

A fastmap object can be serialized (or saved) in one R session and deserialized (or loaded) in another. For performance, the data structure that tracks the mapping between keys and values is implemented in C++, and this data structure will not be serialized, but fastmap also keeps a copy of the same information in an ordinary R vector, which will be serialized. After a fastmap object is deserialized, the C++ data structure will not exist, but the first time any method on the fastmap is called, the C++ data structure will be rebuilt using information from the R vector.

The vector is much slower for lookups, and so it is used only for restoring the C++ data structure after a fastmap object is deserialized or loaded.

Key encoding

Unlike with environments, the keys in a fastmap are always encoded as UTF-8, so if you call m$set() with two different strings that have the same Unicode values but have different encodings, the second call will overwrite the first value. If you call m$keys(), it will return UTF-8 encoded strings, and similarly, m$mget() and m$as_list() will return lists with names that have UTF-8 encoding.

Testing for equality

The base R functions identical() and all.equal() are commonly used to test two objects for equality, but they will not work correctly for fastmap objects. identical() will always report FALSE for two distinct fastmap objects, even if they have the same contents, while all.equal() will always report TRUE for two fastmap objects.

To test whether two fastmap objects have the same contents, compare the results of $as_list(sort=TRUE) for both of the objects. For example:

identical(a$as_list(sort = TRUE), b$as_list(sort = TRUE))
# or
all.equal(a$as_list(sort = TRUE), b$as_list(sort = TRUE))

These comparisons are subject to the technical details of how identical() and all.equal() treat named lists.

Memory leak examples

This example shows how using a regular R environment leaks memory, even when simply checking for the existence of a key.

library(pryr)
gc()
start_mem <- mem_used()
start_time <- as.numeric(Sys.time())
for (i in 1:8) {
  cat(i, ": ", sep = "")
  print(mem_used())
  e <- new.env(parent = emptyenv())
  for (j in 1:10000) {
    # Generate random key
    x <- as.character(runif(1))
    exists(x, envir = e, inherits = FALSE)
  }
  rm(e, x)
}
end_time <- as.numeric(Sys.time())
gc()
end_mem <- mem_used()
cat("Elapsed time:", round(end_time - start_time, 1), "seconds\n")
cat("Memory leaked:", end_mem - start_mem, "bytes\n")

The output looks something like this:

1: 57.9 MB
2: 59.9 MB
3: 61.9 MB
4: 64.4 MB
5: 66.4 MB
6: 68.4 MB
7: 70.4 MB
8: 72.4 MB
Elapsed time: 1.1 seconds
Memory leaked: 16243656 bytes

The elapsed time gets progressively slower as the R symbol table gets larger and larger. After running the above code repeatedly, the elapsed time for the fifth run is 3.1 seconds. If you profile the code with profvis, you can see that most of the slowdown is not with environment operations themselves, but with garbage collection events. This slowdown appears to affect all GC events, even when no environment-related operations are performed between one GC and the next.

For comparison, this example with fastmap does the same thing.

library(fastmap)
library(pryr)
gc()
start_mem <- mem_used()
start_time <- as.numeric(Sys.time())
for (i in 1:8) {
  cat(i, ": ", sep = "")
  print(mem_used())
  m <- fastmap()
  for (j in 1:10000) {
    x <- as.character(runif(1))
    m$has(x)
  }
  rm(m, x)
}
end_time <- as.numeric(Sys.time())
gc()
end_mem <- mem_used()
cat("Elapsed time:", round(end_time - start_time, 1), "seconds\n")
cat("Memory leaked:", end_mem - start_mem, "bytes\n")

The output in a new R session looks something like this (note that this is from the second run of the code above -- for the first run, there is an increase in memory used, but it is probably related to code being run for the first time in the R session):

1: 42.3 MB
2: 42.3 MB
3: 42.3 MB
4: 42.3 MB
5: 42.3 MB
6: 42.3 MB
7: 42.3 MB
8: 42.3 MB
Elapsed time: 0.9 seconds
Memory leaked: 0 bytes

It does not leak memory, and it does not slow down if you run it repeatedly. After running it ten times, it still takes 0.9 seconds, and leaks no memory.

The simple tests above simply check for the existence of keys, but with setting values, the results are similar.

Note that the environment operations are themselves slightly faster than the fastmap operations, but the penalty is in slower garbage collection when many keys have been used. Also keep in mind that these tests are very artificial and use tens of thousands of random keys; if your application does not do this, then fastmap may have no practical benefit. In general, these operations are so fast that performance bottlenecks almost always lie elsewhere.

Testing your code for symbol leakage

If you want to test your code directly for symbol leakage, you can use the code below. (Note: This only works on Mac.)

The get_symbols() function returns all symbols that are registered in R's symbol table.

new_symbols() returns all symbols that have been added since the last time new_symbols() was run. If you want to test whether your code causes the symbol table to grow, run new_symbols(), then run your code, then run new_symbols() again.

# Note: this will only compile on a Mac. `R_SymbolTable` is not an exported
# symbol from Defn.h, but the a Mac, the linker exports all C symbols.

get_symbols <- inline::cfunction(
  includes = "
    #define HSIZE   49157 /* The size of the hash table for symbols, from Defn.h */
    extern SEXP* R_SymbolTable;
  ",
  body = "
    int symbol_count = 0;
    SEXP s;
    int j;
    for (j = 0; j < HSIZE; j++) {
      for (s = R_SymbolTable[j]; s != R_NilValue; s = CDR(s)) {
        if (CAR(s) != R_NilValue) {
          symbol_count++;
        }
      }
    }


    SEXP result = PROTECT(Rf_allocVector(STRSXP, symbol_count));
    symbol_count = 0;
    for (j = 0; j < HSIZE; j++) {
      for (s = R_SymbolTable[j]; s != R_NilValue; s = CDR(s)) {
        if (CAR(s) != R_NilValue) {
          SET_STRING_ELT(result, symbol_count, PRINTNAME(CAR(s)));
          symbol_count++;
        }
      }
    }

    UNPROTECT(1);
    return result;
  "
)

# Test it out
get_symbols()


# new_symbols() returns a character vector of symbols that have been added since
# the last time it was run.
last_symbols <- get_symbols()
new_symbols <- function() {
  cur_symbols <- get_symbols()
  res <- setdiff(cur_symbols, last_symbols)
  last_symbols <<- cur_symbols
  res
}

# Example

# The first couple times it's run, R might do something that adds symbols, like
# load the compiler package. Run it a bunch of times until it returns
# character(0).
new_symbols()
new_symbols()
new_symbols()
# character(0)


# After R stops loading things, run our code and see which new symbols have
# been added.
abcdefg <- 1
exists("xyz")
new_symbols()
#> [1] "abcdefg" "xyz"

More Repositories

1

devtools

Tools to make an R developer's life easier
R
2,336
star
2

lintr

Static Code Analysis for R
R
1,135
star
3

httr

httr: a friendly http package for R
R
975
star
4

actions

GitHub Actions for the R community
JavaScript
868
star
5

testthat

An R πŸ“¦ to make testing πŸ˜€
R
849
star
6

usethis

Set up commonly used πŸ“¦ components
R
798
star
7

pkgdown

Generate static html documentation for an R package
R
686
star
8

styler

Non-invasive pretty printing of R code
R
657
star
9

pak

A fresh approach to package installation
C
575
star
10

cli

Tools for making beautiful & useful command line interfaces
R
571
star
11

roxygen2

Generate R package documentation from inline R comments
R
554
star
12

rig

The R Installation Manager
Rust
460
star
13

rlang

Low-level API for programming with R
R
454
star
14

progress

Progress bar in your R terminal
R
447
star
15

R6

Encapsulated object-oriented programming for R
R
393
star
16

here

A simpler way to find your files
R
387
star
17

scales

Tools for ggplot2 scales
R
373
star
18

fs

Provide cross platform file operations based on libuv.
C
353
star
19

covr

Test coverage reports for R
R
328
star
20

rex

Friendly regular expressions for R.
R
325
star
21

crayon

πŸ–οΈ R package for colored terminal output β€” now superseded by cli
R
321
star
22

memoise

Easy memoisation for R
R
310
star
23

remotes

Install R packages from GitHub, GitLab, Bitbucket, git, svn repositories, URLs
R
309
star
24

lobstr

Understanding complex R objects with tools similar to str()
R
294
star
25

callr

Call R from R
R
281
star
26

vctrs

Generic programming with typed R vectors
C
272
star
27

waldo

Find differences between R objects
R
272
star
28

slider

Sliding Window Functions
R
267
star
29

zeallot

Variable assignment with zeal! (or multiple, unpacking, and destructuring assignment in R)
R
245
star
30

conflicted

An alternative conflict resolution strategy for R
R
242
star
31

bench

High Precision Timing of R Expressions
R
237
star
32

gmailr

Access the Gmail RESTful API from R.
R
234
star
33

processx

Execute and Control Subprocesses from R
R
225
star
34

xml2

Bindings to libxml2
R
212
star
35

asciicast

Turn R scripts into terminal screencasts
R
211
star
36

gh

Minimalistic GitHub API client in R
R
210
star
37

httr2

Make HTTP requests and process their responses. A modern reimagining of httr.
R
206
star
38

cpp11

cpp11 helps you to interact with R objects using C++ code.
C++
187
star
39

keyring

πŸ” Access the system credential store from R
R
185
star
40

vdiffr

Visual regression testing and graphical diffing with testthat
C++
177
star
41

svglite

A lightweight svg graphics device for R
C++
177
star
42

pillar

Format columns with colour
R
173
star
43

ragg

Graphic Devices Based on AGG
C++
169
star
44

ymlthis

write YAML for R Markdown, bookdown, blogdown, and more
R
163
star
45

hugodown

Make websites with hugo and RMarkdown
R
163
star
46

withr

Methods For Temporarily Modifying Global State
R
162
star
47

coro

Coroutines for R
R
146
star
48

rprojroot

Finding files in project subdirectories
R
146
star
49

debugme

Easy and efficient debugging for R packages
R
144
star
50

available

Check if a package name is available to use
R
141
star
51

ellipsis

Tools for Working with ...
R
138
star
52

archive

R bindings to libarchive, supporting a large variety of archive formats
C++
138
star
53

gert

Simple git client for R
C
136
star
54

later

Schedule an R function or formula to run after a specified period of time.
C++
132
star
55

rray

Simple Arrays
R
130
star
56

isoband

isoband: An R package to generate contour lines and polygons.
C++
130
star
57

prettyunits

Pretty, human readable formatting of quantities
JavaScript
126
star
58

tidyselect

A backend for functions taking tidyverse selections
R
122
star
59

desc

Manipulate DESCRIPTION files
R
120
star
60

gargle

Infrastructure for calling Google APIs from R, including auth
R
112
star
61

rcmdcheck

Run R CMD check from R and collect the results
R
110
star
62

evaluate

A version of eval for R that returns more information about what happened
R
107
star
63

prettycode

Syntax highlight R code in the terminal
R
100
star
64

mockery

A mocking library for R.
R
100
star
65

sloop

S language OOP ⛡️
R
98
star
66

pkgdepends

R Package Dependency Resolution
R
93
star
67

revdepcheck

R package reverse dependency checking
R
93
star
68

clock

A Date-Time Library for R
R
93
star
69

lifecycle

Manage the life cycle of your exported functions and arguments
R
91
star
70

systemfonts

System Native Font Handling in R
C++
90
star
71

gtable

The layout packages that powers ggplot2
R
85
star
72

askpass

Password Entry for R, Git, and SSH
R
83
star
73

rappdirs

Find OS-specific directories to store data, caches, and logs. A port of python's AppDirs
R
81
star
74

zip

Platform independent zip compression via miniz
C
81
star
75

commonmark

High Performance CommonMark and Github Markdown Rendering in R
C
81
star
76

downlit

Syntax Highlighting and Automatic Linking
R
80
star
77

clisymbols

Unicode symbols for CLI applications, with fallbacks
R
74
star
78

tree-sitter-r

C
74
star
79

ps

R package to query, list, manipulate system processes
C
72
star
80

sessioninfo

Print Session Information
R
72
star
81

pkgapi

Create a map of functions for an R package - WORK IN PROGRESS!
R
69
star
82

credentials

Tools for Managing SSH and Git Credentials
R
69
star
83

roxygen2md

Convert elements of roxygen documentation to markdown
R
69
star
84

sodium

R bindings to libsodium
R
68
star
85

backports

Reimplementations of Functions Introduced Since R-3.0.0
R
65
star
86

pkgbuild

Find tools needed to build R packages
R
65
star
87

cliapp

Rich Command Line Applications
R
62
star
88

webfakes

Fake web apps for HTTP testing R packages
C
61
star
89

generics

Common generic methods
R
60
star
90

diffviewer

HTML widget to visually compare files
JavaScript
57
star
91

liteq

Serverless R message queue using SQLite
R
55
star
92

pkgload

Simulate installing and loading a package
R
55
star
93

cachem

Key-value caches for R
R
53
star
94

carrier

Create standalone functions for remote execution
R
49
star
95

brio

Basic R Input Output
R
49
star
96

jose

Javascript Object Signing and Encryption for R
R
47
star
97

urlchecker

Run CRAN URL checks from older versions of R
R
46
star
98

pkgconfig

Private configuration for R packages
R
40
star
99

filelock

Cross platform file locking in R
R
39
star
100

pkginstall

Provides a replacement for `utils::install.packages()`
R
35
star