• Stars
    star
    129
  • Rank 279,262 (Top 6 %)
  • Language
    R
  • License
    Other
  • Created over 9 years ago
  • Updated 7 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

DBI-based adapter for Presto for the statistical programming language R.

RPresto

CRAN status Lifecycle: experimental test-coverage check-standard

RPresto is a DBI-based adapter for the open source distributed SQL query engine Presto for running interactive analytic queries.

Installation

RPresto is both on CRAN and github.

For the CRAN version, you can use

install.packages("RPresto")

You can install the development version of RPresto from GitHub with:

# install.packages("devtools")
devtools::install_github("prestodb/RPresto")

Usage

The following examples assume that you have a in-memory Presto server set up locally. It’s the simplest server which stores all data and metadata in RAM on workers and both are discarded when Presto restarts. If you don’t have one set up, please refer to the memory connector documentation.

# Load libaries and connect to Presto
library(RPresto)
library(DBI)

con <- DBI::dbConnect(
  drv = RPresto::Presto(),
  host = "http://localhost",
  port = 8080,
  user = Sys.getenv("USER"),
  catalog = "memory",
  schema = "default"
)

There are two levels of APIs: DBI and dplyr.

DBI APIs

The easiest and most flexible way of executing a SELECT query is using a dbGetQuery() call. It returns the query result in a tibble.

DBI::dbGetQuery(con, "SELECT CAST(3.14 AS DOUBLE) AS pi")
#> # A tibble: 1 Γ— 1
#>      pi
#>   <dbl>
#> 1  3.14

dbWriteTable() can be used to write a small data frame into a Presto table.

# Writing iris data frame into Presto
DBI::dbWriteTable(con, "iris", iris)

dbExistsTable() checks if a table exists.

DBI::dbExistsTable(con, "iris")
#> [1] TRUE

dbReadTable() reads the entire table into R. It’s essentially a SELECT * query on the table.

DBI::dbReadTable(con, "iris")
#> # A tibble: 150 Γ— 5
#>    sepal.length sepal.width petal.length petal.width species
#>           <dbl>       <dbl>        <dbl>       <dbl> <chr>  
#>  1          5.1         3.5          1.4         0.2 setosa 
#>  2          4.9         3            1.4         0.2 setosa 
#>  3          4.7         3.2          1.3         0.2 setosa 
#>  4          4.6         3.1          1.5         0.2 setosa 
#>  5          5           3.6          1.4         0.2 setosa 
#>  6          5.4         3.9          1.7         0.4 setosa 
#>  7          4.6         3.4          1.4         0.3 setosa 
#>  8          5           3.4          1.5         0.2 setosa 
#>  9          4.4         2.9          1.4         0.2 setosa 
#> 10          4.9         3.1          1.5         0.1 setosa 
#> # … with 140 more rows

dbRemoveTable() drops the table from Presto.

DBI::dbRemoveTable(con, "iris")

You can execute a statement and returns the number of rows affected using dbExecute().

# Create an empty table using CREATE TABLE
DBI::dbExecute(
  con, "CREATE TABLE testing_table (field1 BIGINT, field2 VARCHAR)"
)
#> [1] 0

dbExecute() returns the number of rows affected by the statement. Since a CREATE TABLE statement creates an empty table, it returns 0.

DBI::dbExecute(
  con,
  "INSERT INTO testing_table VALUES (1, 'abc'), (2, 'xyz')"
)
#> [1] 2

Since 2 rows are inserted into the table, it returns 2.

# Check the previous INSERT statment works
DBI::dbReadTable(con, "testing_table")
#> # A tibble: 2 Γ— 2
#>   field1 field2
#>    <int> <chr> 
#> 1      1 abc   
#> 2      2 xyz

dplyr APIs

We also include dplyr database backend integration (which is mainly implemented using the dbplyr package).

# Load packages
library(dplyr)
library(dbplyr)

We can use dplyr::copy_to() to write a local data frame to a PrestoConnection and immediately create a remote table on it.

# Add mtcars to Presto
if (DBI::dbExistsTable(con, "mtcars")) {
  DBI::dbRemoveTable(con, "mtcars")
}
tbl.mtcars <- dplyr::copy_to(dest = con, df = mtcars, name = "mtcars")
# colnames() gives the column names
tbl.mtcars %>% colnames()
#>  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
#> [11] "carb"

dplyr::tbl() also work directly on the PrestoConnection.

# Treat "iris" in Presto as a remote data source that dplyr can now manipulate
if (!DBI::dbExistsTable(con, "iris")) {
  DBI::dbWriteTable(con, "iris", iris)
}
tbl.iris <- dplyr::tbl(con, "iris")

tbl.iris %>% colnames()
#> [1] "sepal.length" "sepal.width"  "petal.length" "petal.width"  "species"

# dplyr verbs can be applied onto the remote data source
tbl.iris %>%
  group_by(species) %>%
  summarize(
    mean_sepal_length = mean(sepal.length, na.rm = TRUE)
  ) %>%
  arrange(species) %>%
  collect()
#> # A tibble: 3 Γ— 2
#>   species    mean_sepal_length
#>   <chr>                  <dbl>
#> 1 setosa                  5.01
#> 2 versicolor              5.94
#> 3 virginica               6.59

BIGINT handling

RPresto’s handling of BIGINT (i.e.Β 64-bit integers) is similar to other DBI packages (e.g.Β bigrquery, RPostgres). We provide a bigint argument that users can use in multiple interfaces to specify how they want BIGINT typed data to be translated into R.

The bigint argument takes one of the following 4 possible values.

  1. bigint = "integer" is the default setting. It translates BIGINT to R’s native integer type (i.e.Β 32-bit integer). The range of 32-bit integer is [-2,147,483,648, 2,147,483,647] which should cover most integer use cases.

  2. In case that you need to represent integer values outside of the 32-bit integer range, you have 2 options: bigint = "numeric" which translates the number into a double floating-point type in R; and bigint = "integer64" which packages the number using the bit64::integer64 class. Note that both of those two approaches actually the same precision-preservation range: +/-(2^53-1) = +/-9,007,199,254,740,991, due to the fact that the Presto REST API uses JSON to encode the number and JSON has a limit at 53 bits (rather than 64 bits).

  3. bigint = "character" casts the number into a string. This is most useful when BIGINT is used to represent an ID rather than a real arithmetic number.

Where to use bigint

The DBI interface function dbGetQuery() is the most fundamental interface whereby bigint can be specified. All other interfaces are either built on top of dbGetQuery() or only take effect when used with dbGetQuery().

# BIGINT within the 32-bit integer range is simply translated into integer
DBI::dbGetQuery(con, "SELECT CAST(1 AS BIGINT) AS small_bigint")
#> # A tibble: 1 Γ— 1
#>   small_bigint
#>          <int>
#> 1            1

# BIGINT outside of the 32-bit integer range generates a warning and returns NA
# when bigint is not specified
DBI::dbGetQuery(con, "SELECT CAST(POW(2, 31) AS BIGINT) AS overflow_bigint")
#> Warning in as.integer.integer64(x): NAs produced by integer overflow
#> # A tibble: 1 Γ— 1
#>   overflow_bigint
#>             <int>
#> 1              NA

# Using bigint to specify numeric or integer64 translations
DBI::dbGetQuery(
  con, "SELECT CAST(POW(2, 31) AS BIGINT) AS bigint_numeric",
  bigint = "numeric"
)
#> # A tibble: 1 Γ— 1
#>   bigint_numeric
#>            <dbl>
#> 1     2147483648
DBI::dbGetQuery(
  con, "SELECT CAST(POW(2, 31) AS BIGINT) AS bigint_integer64",
  bigint = "integer64"
)
#> # A tibble: 1 Γ— 1
#>   bigint_integer64
#>            <int64>
#> 1       2147483648

When used with the dplyr interface, bigint can be specified in two places.

  1. Users can pass the bigint argument to dbConnect() when creating the PrestoConnection. All queries that use the connection later will use the specified bigint setting.
con.bigint <- DBI::dbConnect(
  drv = RPresto::Presto(),
  host = "http://localhost",
  port = 8080,
  user = Sys.getenv("USER"),
  catalog = "memory",
  schema = "default",
  # bigint can be specified in dbConnect
  bigint = "integer64"
)

# BIGINT outside of the 32-bit integer range is automatically translated to
# integer64, per the connection setting earlier
DBI::dbGetQuery(
  con.bigint, "SELECT CAST(POW(2, 31) AS BIGINT) AS bigint_integer64"
)
#> # A tibble: 1 Γ— 1
#>   bigint_integer64
#>            <int64>
#> 1       2147483648
  1. If you only want to specify bigint for a particular query when using the dplyr interface without affecting other queries, you can pass bigint to the collect() call.
tbl.bigint <- dplyr::tbl(
  con, sql("SELECT CAST(POW(2, 31) AS BIGINT) AS bigint")
)

# Default collect() generates a warning and returns NA
dplyr::collect(tbl.bigint)
#> Warning in as.integer.integer64(x): NAs produced by integer overflow
#> # A tibble: 1 Γ— 1
#>   bigint
#>    <int>
#> 1     NA

# Passing bigint to collect() specifies BIGINT treatment
dplyr::collect(tbl.bigint, bigint = "integer64")
#> # A tibble: 1 Γ— 1
#>       bigint
#>      <int64>
#> 1 2147483648

Connecting to Trino

To connect to Trino you must set the use.trino.headers parameter so RPresto knows to send the correct headers to the server. Otherwise all the same functionality is supported.

con.trino <- DBI::dbConnect(
  RPresto::Presto(),
  use.trino.headers = TRUE,
  host = "http://localhost",
  port = 8080,
  user = Sys.getenv("USER"),
  schema = "<schema>",
  catalog = "<catalog>",
  source = "<source>"
)

Passing extra credentials to the connector

To pass extraCredentials that gets added to the X-Presto-Extra-Credential header use the extra.credentials parameter so RPresto will add that to the header while creating the PrestoConnection.

Set use.trino.headers if you want to pass extraCredentials through the X-Trino-Extra-Credential header.

con <- DBI::dbConnect(
  RPresto::Presto(),
  host = "http://localhost",
  port = 7777,
  user = Sys.getenv("USER"),
  schema = "<schema>",
  catalog = "<catalog>",
  source = "<source>",
  extra.credentials = "test.token.foo=bar",
)

Use RPresto with Kerberos

Assuming that you have Kerberos already set up with the Presto coordinator, you can use RPresto with Kerberos by passing a Kerberos config header to the request.config argument of dbConnect(). We provide a convenient wrapper kerberos_configs() to further simplify the workflow.

con <- DBI::dbConnect(
  RPresto::Presto(),
  host = "http://localhost",
  port = 7777,
  user = Sys.getenv("USER"),
  schema = "<schema>",
  catalog = "<catalog>",
  source = "<source>",
  request.config = kerberos_configs()
)

How RPresto works

Presto exposes its interface via a REST based API1. We utilize the httr package to make the API calls and use jsonlite to reshape the data into a tibble.

RPresto has been tested on Presto 0.100.

License

RPresto is BSD-licensed.

Footnotes

  1. See https://github.com/prestodb/presto/wiki/HTTP-Protocol for a description of the API. ↩

More Repositories

1

presto

The official home of the Presto distributed SQL query engine for big data
Java
15,970
star
2

presto-python-client

Python DB-API client for Presto
Python
236
star
3

presto-go-client

A Presto client for the Go programming language.
Go
225
star
4

presto-admin

A tool to install, configure and manage Presto installations
Python
171
star
5

tempto

A testing framework for Presto
Java
61
star
6

presto-yarn

Java
57
star
7

f8-2019-demo

A tutorial on how to get started with Presto.
Jupyter Notebook
56
star
8

ambari-presto-service

Ambari service for Presto
Python
44
star
9

docker-images

Docker images for Presto integration testing
Dockerfile
35
star
10

presto-kubernetes-operator

Go
32
star
11

benchto

Framework for running macro benchmarks in a clustered environment
JavaScript
24
star
12

presto-hive-apache

Shaded version of Apache Hive for Presto
Java
19
star
13

presto-jdbc-java6

Presto JDBC driver compatible with Java 6
Java
18
star
14

prestorials

Tutorials and examples of how to deploy Presto and connect it to different data sources
17
star
15

presto-hadoop-apache2

Shaded version of Apache Hadoop 2.x for Presto
Java
14
star
16

presto-query-predictor

A query predictor pipeline and service to predict resource usages of Presto queries
Python
14
star
17

presto-helm-charts

Presto Helm Charts
12
star
18

presto-landscape

Presto Landscape
10
star
19

sql

A Modern SQL frontend based on SQL16 with extensions for streaming, graph, rich types, etc, including parser, resolver, rewriters, etc.
Java
10
star
20

presto-js-client

Monorepo for Presto JavaScript packages
TypeScript
10
star
21

presto-hadoop-cdh4

Shaded version of CDH4 Hadoop for Presto
Java
7
star
22

tsc

Presto Foundation TSC
7
star
23

prestodb.github.io

Presto website
HTML
6
star
24

presto-maven-plugin

Maven packaging and lifecycle for Presto plugins
Java
6
star
25

orc-protobuf

Compiled ORC protobuf definitions
4
star
26

foundation

Presto Foundation
3
star
27

presto-hive-jdbc

Shaded version of Apache Hive JDBC driver for Presto
3
star
28

presto-hadoop-apache1

Shaded version of Apache Hadoop 1.x for Presto
Java
3
star
29

presto-release-tools

Standard utilities to release Presto
Java
2
star
30

testing-mysql-server

Embedded MySQL server for use in tests
Java
1
star
31

artwork

Presto artwork assets. Please adhere to the Linux Foundation trademark policy.
1
star
32

presto-hive-dwrf

Presto Hive DWRF
Java
1
star
33

presto-cassandra-driver

Shaded version of DataStax Java Driver for Apache Cassandra
Java
1
star