dbx
๐ฅ A fast, easy-to-use database library for R
- Intuitive functions
- High performance batch operations
- Safe inserts, updates, and deletes without writing SQL
- Upserts!!
- Great date and time support
- Works well with auto-incrementing primary keys
- Built on top of DBI
Designed for both research and production environments
Supports Postgres, MySQL, MariaDB, SQLite, and more
Installation
Install dbx
install.packages("dbx")
And follow the instructions for your database
To install with Jetpack, use:
jetpack::add("dbx")
Postgres
Install the R package
install.packages("RPostgres")
And use:
library(dbx)
db <- dbxConnect(adapter="postgres", dbname="mydb")
You can also pass user
, password
, host
, port
, and url
.
Works with RPostgreSQL as well
MySQL & MariaDB
Install the R package
install.packages("RMySQL")
And use:
library(dbx)
db <- dbxConnect(adapter="mysql", dbname="mydb")
You can also pass user
, password
, host
, port
, and url
.
Works with RMariaDB as well
SQLite
Install the R package
install.packages("RSQLite")
And use:
library(dbx)
db <- dbxConnect(adapter="sqlite", dbname=":memory:")
SQL Server
Install the R package
install.packages("odbc")
And use:
library(dbx)
db <- dbxConnect(adapter=odbc::odbc(), database="mydb")
You can also pass uid
, pwd
, server
, and port
.
Redshift
For Redshift, follow the Postgres instructions.
Others
Install the appropriate R package and use:
db <- dbxConnect(adapter=odbc::odbc(), database="mydb")
Operations
Select
Create a data frame of records from a SQL query
records <- dbxSelect(db, "SELECT * FROM forecasts")
Pass parameters
dbxSelect(db, "SELECT * FROM forecasts WHERE period = ? AND temperature > ?", params=list("hour", 27))
Parameters can also be vectors
dbxSelect(db, "SELECT * FROM forecasts WHERE id IN (?)", params=list(1:3))
Insert
Insert records
table <- "forecasts"
records <- data.frame(temperature=c(32, 25))
dbxInsert(db, table, records)
If you use auto-incrementing ids in Postgres, you can get the ids of newly inserted rows by passing the column name:
dbxInsert(db, table, records, returning=c("id"))
Update
Update records
records <- data.frame(id=c(1, 2), temperature=c(16, 13))
dbxUpdate(db, table, records, where_cols=c("id"))
Use where_cols
to specify the columns used for lookup. Other columns are written to the table.
Updates are batched when possible, but often need to be run as multiple queries. We recommend upsert when possible for better performance, as it can always be run as a single query. Turn on logging to see the difference.
Upsert
Only available for PostgreSQL 9.5+, MySQL 5.5+, and SQLite 3.24+
Atomically insert if they donโt exist, otherwise update them
records <- data.frame(id=c(2, 3), temperature=c(20, 25))
dbxUpsert(db, table, records, where_cols=c("id"))
Use where_cols
to specify the columns used for lookup. There must be a unique index on them, or an error will be thrown.
To skip existing rows instead of updating them, use:
dbxUpsert(db, table, records, where_cols=c("id"), skip_existing=TRUE)
If you use auto-incrementing ids in Postgres, you can get the ids of newly upserted rows by passing the column name:
dbxUpsert(db, table, records, where_cols=c("id"), returning=c("id"))
Delete
Delete specific records
bad_records <- data.frame(id=c(1, 2))
dbxDelete(db, table, where=bad_records)
Delete all records (uses TRUNCATE
when possible for performance)
dbxDelete(db, table)
Execute
Execute a statement
dbxExecute(db, "UPDATE forecasts SET temperature = temperature + 1")
Pass parameters
dbxExecute(db, "UPDATE forecasts SET temperature = ? WHERE id IN (?)", params=list(27, 1:3))
Logging
Log all SQL queries with:
options(dbx_logging=TRUE)
Customize logging by passing a function
logQuery <- function(sql) {
# your logging code
}
options(dbx_logging=logQuery)
Database Credentials
Environment variables are a convenient way to store database credentials. This keeps them outside your source control. Itโs also how platforms like Heroku store them.
Create an .Renviron
file in your home directory with:
DATABASE_URL=postgres://user:pass@host/dbname
Install urltools:
install.packages("urltools")
And use:
db <- dbxConnect()
If you have multiple databases, use a different variable name, and:
db <- dbxConnect(url=Sys.getenv("OTHER_DATABASE_URL"))
You can also use a package like keyring.
Batching
By default, operations are performed in a single statement or transaction. This is better for performance and prevents partial writes on failures. However, when working with large data frames on production systems, it can be better to break writes into batches. Use the batch_size
option to do this.
dbxInsert(db, table, records, batch_size=1000)
dbxUpdate(db, table, records, where_cols, batch_size=1000)
dbxUpsert(db, table, records, where_cols, batch_size=1000)
dbxDelete(db, table, records, where, batch_size=1000)
Query Comments
Add comments to queries to make it easier to see where time-consuming queries are coming from.
options(dbx_comment=TRUE)
The comment will be appended to queries, like:
SELECT * FROM users /*script:forecast.R*/
Set a custom comment with:
options(dbx_comment="hi")
Transactions
To perform multiple operations in a single transaction, use:
DBI::dbWithTransaction(db, {
dbxInsert(db, ...)
dbxDelete(db, ...)
})
For updates inside a transaction, use:
dbxUpdate(db, transaction=FALSE)
Schemas
To specify a schema, use:
table <- DBI::Id(schema="schema", table="table")
Data Type Notes
Dates & Times
Dates are returned as Date
objects and times as POSIXct
objects. Times are stored in the database in UTC and converted to your local time zone when retrieved.
Times without dates are returned as character
vectors since R has no built-in support for this type. If you use hms, you can convert columns with:
records$column <- hms::as_hms(records$column)
SQLite does not have support for TIME
columns, so we recommend storing as VARCHAR
.
JSON
JSON and JSONB columns are returned as character
vectors. You can use jsonlite to parse them with:
records$column <- lapply(records$column, jsonlite::fromJSON)
SQLite does not have support for JSON
columns, so we recommend storing as TEXT
.
Binary Data
BLOB and BYTEA columns are returned as raw
vectors.
Data Type Limitations
Dates & Times
RSQLite does not currently provide enough info to automatically typecast dates and times. You can manually typecast date columns with:
records$column <- as.Date(records$column)
And time columns with:
records$column <- as.POSIXct(records$column, tz="Etc/UTC")
attr(records$column, "tzone") <- Sys.timezone()
Booleans
RMariaDB and RSQLite do not currently provide enough info to automatically typecast booleans. You can manually typecast with:
records$column <- records$column != 0
JSON
RMariaDB does not currently support JSON.
Binary Data
RMySQL can write BLOB columns, but canโt retrieve them directly. To workaround this, use:
records <- dbxSelect(db, "SELECT HEX(column) AS column FROM table")
hexToRaw <- function(x) {
y <- strsplit(x, "")[[1]]
z <- paste0(y[c(TRUE, FALSE)], y[c(FALSE, TRUE)])
as.raw(as.hexmode(z))
}
records$column <- lapply(records$column, hexToRaw)
Bigint
BIGINT columns are returned as numeric
vectors. The numeric
type in R loses precision above 253. Some libraries (RPostgres, RMariaDB, RSQLite, ODBC) support returning bit64::integer64
vectors instead.
dbxConnect(bigint="integer64")
Connection Pooling
Install the pool package
install.packages("pool")
Create a pool
library(pool)
factory <- function() {
dbxConnect(adapter="postgres", ...)
}
pool <- poolCreate(factory, maxSize=5)
Run queries
conn <- poolCheckout(pool)
tryCatch({
dbxSelect(conn, "SELECT * FROM forecasts")
}, finally={
poolReturn(conn)
})
In the future, dbx commands may work directly with pools.
Security
When connecting to a database over a network you donโt fully trust, make sure your connection is secure.
With Postgres, use:
db <- dbxConnect(adapter="postgres", sslmode="verify-full", sslrootcert="ca.pem")
With RMariaDB, use:
db <- dbxConnect(adapter="mysql", ssl.ca="ca.pem")
Please let us know if you have a way that works with RMySQL.
Variables
Set session variables with:
db <- dbxConnect(variables=list(search_path="archive"))
Timeouts
Set a statement timeout with:
# Postgres
db <- dbxConnect(variables=list(statement_timeout=1000)) # ms
# MySQL 5.7.8+
db <- dbxConnect(variables=list(max_execution_time=1000)) # ms
# MariaDB 10.1.1+
db <- dbxConnect(variables=list(max_statement_time=1)) # sec
With Postgres, set a connect timeout with:
db <- dbxConnect(connect_timeout=3) # sec
Compatibility
All connections are simply DBI connections, so you can use them anywhere you use DBI.
dbCreateTable(db, ...)
Install dbplyr to use data with dplyr.
forecasts <- tbl(db, "forecasts")
Reference
To close a connection, use:
dbxDisconnect(db)
Upgrading
0.2.0
Version 0.2.0 brings a number of fixes and improvements to data types.
However, there a few breaking changes to be aware of:
-
The
dbxInsert
anddbxUpsert
functions no longer return a data frame by default. For MySQL and SQLite, the data frame was just therecords
argument. For Postgres, if you use auto-incrementing primary keys, the data frame contained ids of the newly inserted/upserted records. To get the ids, pass name of the column as thereturning
argument:dbxInsert(db, table, records, returning=c("id"))
-
timestamp without time zone
columns in Postgres are now stored in UTC instead of local time by default. This does not affecttimestamp with time zone
columns. To keep the previous behavior, use:dbxConnect(adapter="postgres", storage_tz=Sys.timezone(), ...)
History
View the changelog
Contributing
Everyone is encouraged to help improve this project. Here are a few ways you can help:
- Report bugs
- Fix bugs and submit pull requests
- Write, clarify, or fix documentation
- Suggest or add new features
To get started with development:
git clone https://github.com/ankane/dbx.git
cd dbx
# create Postgres database
createdb dbx_test
# create MySQL database
mysqladmin create dbx_test
In R, do:
install.packages("devtools")
devtools::install_deps(dependencies=TRUE)
devtools::test()