• Stars
    star
    111
  • Rank 314,510 (Top 7 %)
  • Language
    R
  • Created about 6 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

R package: parallelly - Enhancing the 'parallel' Package
CRAN check status R CMD check status Top reverse-dependency checks status Coverage Status

parallelly: Enhancing the 'parallel' Package The 'parallelly' hexlogo

The parallelly package provides functions that enhance the parallel packages. For example, availableCores() gives the number of CPU cores available to your R process as given by R options and environment variables, including those set by job schedulers on high-performance compute (HPC) clusters. If R runs under 'cgroups' or a Linux container, then their settings are acknowledges too. If nothing else is set, the it will fall back to parallel::detectCores(). Another example is makeClusterPSOCK(), which is backward compatible with parallel::makePSOCKcluster() while doing a better job in setting up remote cluster workers without having to know your local public IP address and configuring the firewall to do port-forwarding to your local computer. The functions and features added to this package are written to be backward compatible with the parallel package, such that they may be incorporated there later. The parallelly package comes with an open invitation for the R Core Team to adopt all or parts of its code into the parallel package.

Feature Comparison 'parallelly' vs 'parallel'

parallelly parallel
remote clusters without knowing local public IP βœ“ N/A
remote clusters without firewall configuration βœ“ N/A
remote username in ~/.ssh/config βœ“ R (>= 4.1.0) with user = NULL
set workers' library package path on startup βœ“ N/A
set workers' environment variables on startup βœ“ N/A
custom workers startup code βœ“ N/A
fallback to RStudio' SSH and PuTTY's plink βœ“ N/A
faster, parallel setup of local workers (R >= 4.0.0) βœ“ βœ“
faster, little-endian protocol by default βœ“ N/A
faster, low-latency socket connections by default βœ“ N/A
validation of cluster at setup βœ“ βœ“
attempt to launch failed workers multiple times βœ“ N/A
collect worker details at cluster setup βœ“ N/A
termination of workers if cluster setup fails βœ“ R (>= 4.0.0)
shutdown of cluster by garbage collector βœ“ N/A
combining multiple, existing clusters βœ“ N/A
more informative printing of cluster objects βœ“ N/A
check if local and remote workers are alive βœ“ N/A
restart local and remote workers βœ“ N/A
defaults via options & environment variables βœ“ N/A
respecting CPU resources allocated by cgroups, Linux containers, and HPC schedulers βœ“ N/A
early error if requesting more workers than possible βœ“ N/A
informative error messages βœ“ N/A

Compatibility with the parallel package

Any cluster created by the parallelly package is fully compatible with the clusters created by the parallel package and can be used by all of parallel's functions for cluster processing, e.g. parallel::clusterEvalQ() and parallel::parLapply(). The parallelly::makeClusterPSOCK() function can be used as a stand-in replacement of the parallel::makePSOCKcluster(), or equivalently, parallel::makeCluster(..., type = "PSOCK").

Most of parallelly functions apply also to clusters created by the parallel package. For example,

cl <- parallel::makeCluster(2)
cl <- parallelly::autoStopCluster(cl)

makes the cluster created by parallel to shut down automatically when R's garbage collector removes the cluster object. This lowers the risk for leaving stray R worker processes running in the background by mistake. Another way to achieve the above in a single call is to use:

cl <- parallelly::makeClusterPSOCK(2, autoStop = TRUE)

availableCores() vs parallel::detectCores()

The availableCores() function is designed as a better, safer alternative to detectCores() of the parallel package. It is designed to be a worry-free solution for developers and end-users to query the number of available cores - a solution that plays nice on multi-tenant systems, in Linux containers, on high-performance compute (HPC) cluster, on CRAN and Bioconductor check servers, and elsewhere.

Did you know that parallel::detectCores() might return NA on some systems, or that parallel::detectCores() - 1 might return 0 on some systems, e.g. old hardware and virtual machines? Because of this, you have to use max(1, parallel::detectCores() - 1, na.rm = TRUE) to get it correct. In contrast, parallelly::availableCores() is guaranteed to return a positive integer, and you can use parallelly::availableCores(omit = 1) to return all but one core and always at least one.

Just like other software tools that "hijacks" all cores by default, R scripts, and packages that defaults to detectCores() number of parallel workers cause lots of suffering for fellow end-users and system administrators. For instance, a shared server with 48 cores will come to a halt already after a few users run parallel processing using detectCores() number of parallel workers. This problem gets worse on machines with many cores because they can host even more concurrent users. If these R users would have used availableCores() instead, then the system administrator can limit the number of cores each user get to, say, two (2), by setting the environment variable R_PARALLELLY_AVAILABLECORES_FALLBACK=2. In contrast, it is not possible to override what parallel::detectCores() returns, cf. PR#17641 - WISH: Make parallel::detectCores() agile to new env var R_DEFAULT_CORES .

Similarly, availableCores() is also agile to CPU limitations set by Unix control groups (cgroups), which is often used by Linux containers (e.g. Docker, Apptainer / Singularity, and Podman) and Kubernetes (K8s) environments. For example, docker run --cpuset-cpus=0-2,8 ... sets the CPU affinity so that the processes can only run on CPUs 0, 1, 2, and 8 on the host system. In this case availableCores() detects this and returns four (4). Another example is docker run --cpu=3.4 ..., which throttles the CPU quota to on average 3.4 CPUs on the host system. In this case availableCores() detects this and returns three (3), because it rounds to the nearest integer. In contrast, parallel::detectCores() completely ignores such cgroups settings and returns the number of CPUs on the host system, which results in CPU overuse and degredated performance. Continous Integration (CI) services (e.g. GitHub Actions, Travis CI, and Appveyor CI) and cloud services (e.g. RStudio Cloud) use these types of cgroups settings under the hood, which means availableCores() respects their CPU allocations.

If running on an HPC cluster with a job scheduler, a script that uses availableCores() will run the number of parallel workers that the job scheduler has assigned to the job. For example, if we submit a Slurm job as sbatch --cpus-per-task=16 ..., then availableCores() returns 16, because it respects the SLURM_* environment variables set by the scheduler. On Son of Grid Engine (SGE), the scheduler sets NSLOTS when submitting using qsub -pe smp 8 ... and availableCores() returns eight (8). See help("availableCores", package = "parallelly") for currently supported job schedulers, which includes 'Fujitsu Technical Computing Suite', 'Load Sharing Facility' (LSF), Simple Linux Utility for Resource Management (Slurm), Sun Grid Engine/Oracle Grid Engine/Son of Grid Engine (SGE), Univa Grid Engine (UGE), and TORQUE/PBS.

Of course, availableCores() respects also R options and environment variables commonly used to specify the number of parallel workers, e.g. R option mc.cores and Bioconductor environment variable BIOCPARALLEL_WORKER_NUMBER. It will also detect when running R CMD check and limit the number of workers to two (2), which is the maximum number of parallel workers allowed by the CRAN Policies. This way you, as a package developer, know that your package will always play by the rules on CRAN and Bioconductor.

If nothing is set that limits the number of cores, then availableCores() falls back to parallel::detectCores() and if that returns NA_integer_ then one (1) is returned.

The below table summarize the benefits:

availableCores() parallel::detectCores()
Guaranteed to return a positive integer βœ“ no (may return NA_integer_)
Safely use all but some cores βœ“ no (may return zero or less)
Can be overridden, e.g. by a sysadm βœ“ no
Respects cgroups and Linux containers βœ“ no
Respects job scheduler allocations βœ“ no
Respects CRAN policies βœ“ no
Respects Bioconductor policies βœ“ no

Backward compatibility with the future package

The functions in this package originate from the future package where we have used and validated them for several years. I moved these functions to this separate package, because they are also useful outside of the future framework. For backward-compatibility reasons of the future framework, the R options and environment variables that are prefixed with parallelly.* and R_PARALLELLY_* can for the time being also be set with future.* and R_FUTURE_* prefixes.

Roadmap

  • Submit parallelly to CRAN, with minimal changes compared to the corresponding functions in the future package (on CRAN as of 2020-10-20)

  • Update the future package to import and re-export the functions from the parallelly to maximize backward compatibility in the future framework (future 1.20.1 on CRAN as of 2020-11-03)

  • Switch to use 10-15% faster useXDR=FALSE

  • Implement same fast parallel setup of parallel PSOCK workers as in parallel (>= 4.0.0)

  • After having validated that there is no negative impact on the future framework, allow for changes in the parallelly package, e.g. renaming the R options and environment variable to be parallelly.* and R_PARALLELLY_* while falling back to future.* and R_FUTURE_*

  • Migrate, currently internal, UUID functions and export them, e.g. uuid(), connectionUuid(), and sessionUuid() (HenrikBengtsson/Wishlist-for-R#96). Because R does not have a built-in md5 checksum function that operates on object, these functions require us adding a dependency on the digest package.

  • Add vignettes on how to set up cluster running on local or remote machines, including in Linux containers and on popular cloud services, and vignettes on common problems and how to troubleshoot them

Initially, backward compatibility for the future package is of top priority.

Installation

R package parallelly is available on CRAN and can be installed in R as:

install.packages("parallelly")

Pre-release version

To install the pre-release version that is available in Git branch develop on GitHub, use:

remotes::install_github("HenrikBengtsson/parallelly", ref="develop")

This will install the package from source.

More Repositories

1

future

πŸš€ R package: future: Unified Parallel and Distributed Processing in R for Everyone
R
916
star
2

progressr

δΈ‰ R package: An Inclusive, Unifying API for Progress Updates
R
254
star
3

future.apply

πŸš€ R package: future.apply - Apply Function to Elements in Parallel using Futures
R
200
star
4

matrixStats

R package: Methods that Apply to Rows and Columns of Matrices (and to Vectors)
R
191
star
5

startup

πŸ”§ R package: startup - Friendly R Startup Configuration
R
150
star
6

Wishlist-for-R

Features and tweaks to R that I and others would love to see - feel free to add yours!
R
127
star
7

speedtest-cli-extras

πŸ“Ά Tools to enhance the speedtest-cli network tools
Shell
120
star
8

R.matlab

R package: R.matlab
R
84
star
9

future.batchtools

πŸš€ R package future.batchtools: A Future API for Parallel and Distributed Processing using batchtools
R
83
star
10

doFuture

πŸš€ R package: doFuture - Use Foreach to Parallelize via Future Framework
R
79
star
11

R.utils

πŸ”§ R package: R.utils (this is *not* the utils package that comes with R itself)
R
59
star
12

dirdf

R package: dirdf - Extracts Metadata from Directory and File Names
R
58
star
13

future.callr

πŸš€ R package future.callr: A Future API for Parallel Processing using 'callr'
R
56
star
14

R.cache

♻️ R package: R.cache - Fast and Light-weight Caching (Memoization) of Objects and Results to Speed Up Computations
R
35
star
15

profmem

πŸ”§ R package: profmem - Simple Memory Profiling for R
R
33
star
16

ucsf-vpn

Linux command-line client to manage a UCSF VPN connection
Shell
29
star
17

globals

🌐 R package: Identify Global Objects in R Expressions
R
28
star
18

listenv

R package: listenv - Environments Behaving As Lists
R
28
star
19

dotfiles-for-R

My dotfiles for R, e.g. .Rprofile and .Renviron
R
28
star
20

R.rsp

πŸ“„ R package: Dynamic generation of scientific reports
R
27
star
21

R.oo

R package: R.oo - R Object-Oriented Programming with or without References
R
20
star
22

brother-ptouch-label-printer-on-linux

How to print to a Brother P-touch (PT) label printer on Linux
Lua
18
star
23

R.devices

🎨 R package: Unified Handling of Graphics Devices
R
17
star
24

shellcheck-repl

Validation of Shell Commands Before Evaluation
Shell
14
star
25

future.mapreduce

[EXPERIMENTAL] R package: future.mapreduce - Utility Functions for Future Map-Reduce API Packages
R
13
star
26

port4me

πŸ†“ port4me - Get the Same, Personal, Free TCP Port over and over
Shell
13
star
27

marshal

R package: marshal - Framework to Marshal Objects to be Used in Another R Processes
R
13
star
28

TopDom

R package: TopDom - An efficient and Deterministic Method for identifying Topological Domains in Genomes
R
13
star
29

git-bioc

:octocat: LEGACY: Git commands to keep a Git repository and Bioconductor SVN in sync
Shell
10
star
30

article-bengtsson-future

H. Bengtsson, A Unifying Framework for Parallel and Distributed Processing in R using Futures, The R Journal, 10.32614/RJ-2021-048, 2021
TeX
10
star
31

future.tests

πŸ”© R package: future.tests - Test Suite for Future API Backends
R
10
star
32

future.clustermq

πŸš€ R package future.clustermq: A Future API for Parallel Processing using 'clustermq'
R
9
star
33

aroma.affymetrix

πŸ”¬ R package: Analysis of Large Affymetrix Microarray Data Sets
R
9
star
34

fake-hdf5r

R package: hdf5r - Fake, Dummy, Non-Working 'hdf5r' Package for 'Seurat' Users
R
8
star
35

future-tutorial-user2022

Tutorial: An Introduction to Futureverse for Parallel Processing in R (useR! 2022)
R
8
star
36

RNativeAPI

R package: RNativeAPI - Documentation and Examples of the R Native API (Proof of Concept)
R
8
star
37

future.BatchJobs

πŸš€ R package: future.BatchJobs: A Future API for Parallel and Distributed Processing using BatchJobs [Intentionally archived on CRAN on 2021-01-08]
R
8
star
38

PSCBS

πŸ”¬ R package: Analysis of Parent-Specific DNA Copy Numbers
R
7
star
39

BiocParallel.FutureParam

πŸš€ R package: BiocParallel.FutureParam - Use Futures with BiocParallel
Makefile
7
star
40

affxparser

πŸ”¬ R package: This is the Bioconductor devel version of the affxparser package.
C++
7
star
41

illuminaio

πŸ”¬ R package: This is the Bioconductor devel version of the illuminaio package.
R
6
star
42

x86-64-level

x86-64-level - Get the x86-64 Microarchitecture Level on the Current Machine
Shell
6
star
43

future.aws.lambda

R package: future.aws.lambda - A Future API for Parallel Processing on AWS Lambda
5
star
44

conda-stage

conda-stage: Stage a Conda Environment on Local Disk
Shell
5
star
45

ThinkpadX1-Windows10-Middle_mouse_button_issue

AutoHotkey
5
star
46

teeny

🐣 R package: teeny - A Minimal, Valid, Complete R Package
R
4
star
47

rcli

R package: rcli - R Command-Line Interface Extras
R
4
star
48

easycatfs

easycatfs - Easy Mounting of Slow Folders onto Local Disk
Shell
4
star
49

pkgdown.extras

R package: pkgdown.extras: Enhancing the 'pkgdown' Package
R
3
star
50

TopDomData

R package: TopDomData - Data for the TopDom Package
R
3
star
51

environments

[experimental] R package: environments - Working with Environments and Closures in R
R
3
star
52

CostelloPSCNSeq

R package: Parent-specific Copy-number Estimation Pipeline using HT-Seq Data
R
3
star
53

fix.connections

R package: fix.connections - Workarounds for Deficiencies in R's Built-in Connections [PROTOTYPE]
R
3
star
54

jottr.org-blogdown

JottR - Some Jotter on R
HTML
2
star
55

git-r

A Git Extension Making it Easier to Build R from Source
Shell
2
star
56

revdepcheck.extras

R package: revdepcheck.extras - Reverse-Dependency Checks from the Command Line (CLI)
R
2
star
57

r-base-centos7

Docker container image: Centos 7 with R (UNDER CONSTRUCTION)
2
star
58

R.filesets

R package: R.filesets - Easy Handling of and Access to Files Organized in Structured Directories
R
2
star
59

trackers

PROTOTYPE: trackers - Track Changes in R
R
2
star
60

CBI-software

A Scientific Software Stack for HPC (CentOS oriented)
Makefile
2
star
61

R_CRAN_Booster

Chrome Extension: R CRAN Booster - adds useful annotations to CRAN package pages
JavaScript
2
star
62

drat

R package repository
1
star
63

RGitHubAPI

R
1
star
64

dotfiles-for-emacs

Dot files for Emacs
Emacs Lisp
1
star
65

bash-startup

Bash Startup utility functions
Shell
1
star
66

markin

markin - The Markdown Injector
Shell
1
star
67

AutoHotkey-scripts

AutoHotkey
1
star
68

docker-spark-r

1
star
69

aroma.cn

πŸ”¬ R package: aroma.cn
R
1
star
70

aroma.agilent

πŸ”¬ R package: aroma.agilent [DORMANT]
R
1
star
71

R.batch

R package: R.batch [DEPRECATED]
R
1
star
72

R.lang

R.package: R.lang [DEPRECATED]
R
1
star
73

r-mirrors

Mirror CRAN and Bioconductor repositories on the local file system for R package installaions without internet access
Makefile
1
star
74

covr-utils

[LEGACY] Enhancements for covr making it even easier to do assess source-code coverage of R package tests
R
1
star
75

amazonlinux-r-minimal

Docker Hub Image: docker pull henrikbengtsson/amazonlinux-r-minimal
1
star
76

future.api.tests

[PLANNED] R package future.api.tests: Conformance Tests for the Future API
1
star
77

calmate

πŸ”¬ R package: calmate - Improved Allele-Specific Copy Number of SNP Microarrays for Downstream Segmentation
R
1
star
78

LinuxEnvironmentModules

R package: LinuxEnvironmentModules - An R API to Linux Environment Modules
R
1
star
79

Affx-Fusion-SDK

πŸ”¬ Affymetrix Fusion Software Developers Kit (SDK)
C++
1
star
80

aroma.core

πŸ”¬ R package: aroma.core - Core Methods and Classes Used by 'aroma.*' Packages Part of the Aroma Framework
R
1
star