• Stars
    star
    117
  • Rank 295,078 (Top 6 %)
  • Language
    R
  • License
    GNU General Publi...
  • Created about 9 years ago
  • Updated over 6 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Standard API for Distributed Data Structures in R
title author date
ddR README
Edward Ma, Indrajit Roy, Michael Lawrence
2015-10-22

The 'ddR' package aims to provide an unified R interface for writing parallel and distributed applications. Our goal is to ensure that R programs written using the 'ddR' API work across different distributed backends, therefore, reducing the effort required by users to understand and program on different backends. Currently 'ddR' programs can be executed on R's default 'parallel' package as well as the open source HP Distributed R. We plan to add support for SparkR. This package is an outcome of feedback and collaboration across different companies and R-core members!

Through funding provided by the R-consortium this package is under active development for the summer of 2016. Check out the mailing list to see the latest discussions.

'ddR' is an API, and includes a default execution engine, to express and execute distributed applications. Users can declare distributed objects (i.e., dlist, dframe, darray), and execute parallel operations on these data structures using R-style apply functions. It also allows different backends (that support ddR, and have ddR "drivers" written for them) to be dynamically activated in the R user's environment to execute applications

Please refer to the user guide under vignettes/ for a detailed description on how to use the package.

Some quick examples

library(ddR)

By default, the parallel backend is used with all the cores present on the machine. You can switch backends or specify the number of cores to use with the useBackend function. For example, you can specify that the parallel backend should be used with only 4 cores by executing useBackend(parallel, executors=4).

Initializing a distributed list (dlist):

a <- dmapply(function(x) { x }, rep(3,5))
collect(a)
## [[1]]
## [1] 3
## 
## [[2]]
## [1] 3
## 
## [[3]]
## [1] 3
## 
## [[4]]
## [1] 3
## 
## [[5]]
## [1] 3

Printing a:

a
## 
## ddR Distributed Object
## Type: dlist
## # of partitions: 5
## Partitions per dimension: 5x1
## Partition sizes: [1], [1], [1], [1], [1]
## Length: 5
## Backend: parallel

a is a distributed object in ddR. Note that we did not specify the number of partitions of the output, but by default it is equal to the length of the inputs (5). Use the parameter nparts to specify how the output should be partitioned:

Below is the code to add 1 to the first element of a, 2 to the second, etc. The syntax of dmapply is similar to R's standard mapply function.

b <- dmapply(function(x,y) { x + y }, a, 1:5,nparts=1)
b
## 
## ddR Distributed Object
## Type: dlist
## # of partitions: 1
## Partitions per dimension: 1x1
## Partition sizes: [5]
## Length: 5
## Backend: parallel

Since we specified nparts=1 in dmapply, b only has one partition of 5 elements. Note that the argument nparts is optional, and a user can always ignore it.

collect(b)
## [[1]]
## [1] 4
## 
## [[2]]
## [1] 5
## 
## [[3]]
## [1] 6
## 
## [[4]]
## [1] 7
## 
## [[5]]
## [1] 8

Some other operations: `

Adding a to b, and then subtracting a constant value

addThenSubtract <- function(x,y,z) {
  x + y - z
}
c <- dmapply(addThenSubtract,a,b,MoreArgs=list(z=5))
collect(c)
## [[1]]
## [1] 2
## 
## [[2]]
## [1] 3
## 
## [[3]]
## [1] 4
## 
## [[4]]
## [1] 5
## 
## [[5]]
## [1] 6

We can also process distributed objects partitionwise. Below is an example where we calculate the length of each partition:

d <- dmapply(function(x) length(x),parts(a))
collect(d)
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 1
## 
## [[3]]
## [1] 1
## 
## [[4]]
## [1] 1
## 
## [[5]]
## [1] 1

We partitioned a with 5 parts and it had 5 elements, so the length of each partition is 1.

However, b only had one partition, so that one partition should be of length 5:

e <- dmapply(function(x) length(x),parts(b))
collect(e)
## [[1]]
## [1] 5

Note that parts() and non-parts arguments can be used in any combination to dmapply. parts(dobj) returns a list of the partitions of that dobject, which can be passed into dmapply like any other list. parts(dobj,index), where index is a list, vector, or scalar, returns a specific partition or range of partitions of dobj.

We also have support for darrays and dframes. Check vignettes/ on how to use them.

For more interesting parallel machine learning algorithms, you may view (and run) the example scripts under /examples.

Using the Distributed R backend

To use the Distributed R library for ddR, first install distributedR.ddR and then load it:

library(distributedR.ddR)
## Loading required package: distributedR
## Loading required package: Rcpp
## Loading required package: RInside
## Loading required package: XML
## Loading required package: ddR
## 
## Attaching package: 'ddR'
## 
## The following objects are masked from 'package:distributedR':
## 
##     darray, dframe, dlist, is.dlist
useBackend(distributedR)

Now you can try the different list examples which were used with the 'parallel' backend.

How to Contribute

You can help us in different ways:

  1. Reporting issues.
  2. Contributing code and sending a Pull Request.

In order to contribute the code base of this project, you must agree to the Developer Certificate of Origin (DCO) 1.1 for this project under GPLv2+:

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I have the 
    right to submit it under the open source license indicated in the file; or
(b) The contribution is based upon previous work that, to the best of my 
    knowledge, is covered under an appropriate open source license and I 
    have the right under that license to submit that work with modifications, 
    whether created in whole or in part by me, under the same open source 
    license (unless I am permitted to submit under a different license), 
    as indicated in the file; or
(c) The contribution was provided directly to me by some other person who 
    certified (a), (b) or (c) and I have not modified it.
(d) I understand and agree that this project and the contribution are public and
    that a record of the contribution (including all personal information I submit 
    with it, including my sign-off) is maintained indefinitely and may be 
    redistributed consistent with this project or the open source license(s) involved.

To indicate acceptance of the DCO you need to add a Signed-off-by line to every commit. E.g.:

Signed-off-by: John Doe <[email protected]>

To automatically add that line use the -s switch when running git commit:

$ git commit -s

More Repositories

1

vertica-python

Official native Python client for the Vertica Analytics Database.
Python
376
star
2

VerticaPy

VerticaPy is a Python library that exposes sci-kit like functionality to conduct data science projects on data stored in Vertica, thus taking advantage Vertica’s speed and built-in analytics and machine learning capabilities.
Python
218
star
3

DistributedR

R
163
star
4

Vertica-Extension-Packages

User Defined Extensions (UDX) to the Vertica Analytic Database
HTML
119
star
5

vertica-sql-go

Official native Go client for the Vertica Analytics Database.
Go
60
star
6

Graph-Analytics----Triangle-Counting

Use Big data tools such as Vertica, Hadoop and PIG to count triangles in a graph. Experimentally compare their performance.
Java
49
star
7

vertica-kubernetes

Operator, container and Helm chart to deploy Vertica in Kubernetes
Go
42
star
8

Vertica-Hadoop-Connector

Vertica Hadoop Connector
Java
38
star
9

awesome-vertica

A curated list of awesome Vertica libraries, tools and resources
27
star
10

dbt-vertica

Official dbt adapter for Vertica
Python
25
star
11

Machine-Learning-Examples

Vertica Machine Learning examples and example data.
Python
24
star
12

vertica.dplyr

R
22
star
13

dblink

A Vertica User Defined Transform function to interact with other databases via ODBC
C++
20
star
14

spark-connector

This component acts as a bridge between Spark and Vertica, allowing the user to either retrieve data from Vertica for processing in Spark, or store processed data from Spark into Vertica.
Scala
18
star
15

VerticaPyLab

Use Docker to install a simple demo Vertica.
Jupyter Notebook
17
star
16

vertica-containers

Docker containers for building and testing Vertica extensions
Shell
14
star
17

UDx-Examples

Vertica User Defined Extension (UDx) examples.
C++
12
star
18

vertica-grafana-datasource-.2.0

This project defines the official Vertica datasource plugin for Grafana.
JavaScript
12
star
19

ODBC-Loader

C++
11
star
20

vertica-nodejs

Official native node.js client for the Vertica Analytics Database.
JavaScript
11
star
21

PSTL

Parallel Streaming Transformation Loader
Java
9
star
22

vertica-prometheus-exporter

Official open source Vertica Prometheus Exporter
Go
7
star
23

r-dataconnector

Distributed Data Connector R package
C++
7
star
24

Vertica-Geospatial

Vertica-Geospatial Examples
HTML
7
star
25

Social-Media-Connector

Java
5
star
26

client-application-examples

Examples of how to write a client application to connect to vertica
C#
4
star
27

Vertica-AMI-Packer-Templates

Packer templates and provision scripts for creating Vertica AMIs
Shell
4
star
28

aws-lambda-vertica-loader

Automatic HP Vertica Database Loader for AWS S3
JavaScript
4
star
29

hackathon

A getting started kit for Hackathons using Vertica!
Python
4
star
30

vertica-grafana-datasource

Official Vertica datasource plugin for Grafana.
TypeScript
3
star
31

grafana-dashboards

Dashboards that can be imported into Grafana that use the Prometheus metrics Vertica exports
Python
2
star
32

vertica.github.io

Vertica Open Source
HTML
2
star
33

FlexTable

HPE Vertica Flex Table source code of VMap APIs and parsers
C++
2
star
34

vertica-sqlalchemy-dialect

Official Vertica SQLAlchemy dialect to connect Vertica and SQLAlchemy applications.
Python
2
star
35

charts

1
star
36

vertica-airbyte

Python
1
star
37

DistributedR-demos

CSS
1
star
38

data2ddl

command line tool guessing a table's matching CREATE TABLE statement with optimal data types from a CSV file
C
1
star
39

integrators-guide

A community repo for users that integrate with Vertica with provisioning, monitoring, management, storage, clients, loaders, and other tools.
SCSS
1
star
40

vertica-testenv

Docker-based testing environment for Vertica.
Shell
1
star
41

vertica-highcharts

Python
1
star