• Stars
    star
    868
  • Rank 52,564 (Top 2 %)
  • Language
    R
  • License
    Other
  • Created almost 8 years ago
  • Updated 2 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Data quality assessment and metadata reporting for data frames and database tables

CRAN status License: MIT R build status Linting Coverage status

Best Practices The project has reached a stable, usable state and is being actively developed. Monthly Downloads Total Downloads

Posit Cloud

Contributor Covenant



With the pointblank package itโ€™s really easy to methodically validate your data whether in the form of data frames or as database tables. On top of the validation toolset, the package gives you the means to provide and keep up-to-date with the information that defines your tables.

For table validation, the agent object works with a large collection of simple (yet powerful!) validation functions. We can enable much more sophisticated validation checks by using custom expressions, segmenting the data, and by selective mutations of the target table. The suite of validation functions ensures that everything just works no matter whether your table is a data frame or a database table.

Sometimes, we want to maintain table information and update it when the table goes through changes. For that, we can use an informant object plus associated functions to help define the metadata entries and present it as a data dictionary. Just like we can with validation, pointblank offers easy ways to have the metadata updated so that this important documentation doesn't become stale.


TABLE VALIDATIONS WITH AN AGENT AND DATA QUALITY REPORTING

Data validation can be carried out in Data Quality Reporting workflow, ultimately resulting in the production of a data quality analysis report. This is most useful in a non-interactive mode where data quality for database tables and on-disk data files must be periodically checked. The pointblank agent is given a collection of validation functions to define validation steps. We can get extracts of data rows that failed validation, set up custom functions that are invoked by exceeding set threshold failure rates, etc. Want to email the report regularly (or, only if certain conditions are met)? Yep, you can do all that.

Here is an example of how to use pointblank to validate a local table with an agent.

# Generate a simple `action_levels` object to
# set the `warn` state if a validation step
# has a single 'fail' test unit
al <- action_levels(warn_at = 1)

# Create a pointblank `agent` object, with the
# tibble as the target table. Use three validation
# functions, then, `interrogate()`. The agent will
# then have some useful intel.
agent <- 
  dplyr::tibble(
    a = c(5, 7, 6, 5, NA, 7),
    b = c(6, 1, 0, 6,  0, 7)
  ) %>%
  create_agent(
    label = "A very *simple* example.",
    actions = al
  ) %>%
  col_vals_between(
    vars(a), 1, 9,
    na_pass = TRUE
  ) %>%
  col_vals_lt(
    vars(c), 12,
    preconditions = ~ . %>% dplyr::mutate(c = a + b)
  ) %>%
  col_is_numeric(vars(a, b)) %>%
  interrogate()

The reportingโ€™s pretty sweet. We can get a gt-based report by printing an agent.

The pointblank package is designed to be both straightforward yet powerful. And fast! Local data frames donโ€™t take very long to validate extensively and all validation checks on remote tables are done entirely in-database. So we can add dozens or even hundreds of validation steps without any long waits for reporting.

Should you want to perform validation checks on database or Spark tables, provide a tbl_dbi or tbl_spark object to create_agent(). The pointblank package currently supports PostgreSQL. MySQL, MariaDB, Microsoft SQL Server, Google BigQuery, DuckDB, SQLite, and Spark DataFrames (through the sparklyr package).

Here are some validation reports for the considerably larger intendo::intendo_revenue table.

postgresย ย ย  mysqlย ย ย  duckdb


VALIDATIONS DIRECTLY ON DATA

The Pipeline Data Validation workflow uses the same collection of validation functions but without need of an agent. This is useful for an ETL process where we want to periodically check data and trigger warnings, raise errors, or write out logs when exceeding specified failure thresholds. Itโ€™s a cinch to perform checks on import of the data and at key points during the transformation process, perhaps stopping data flow if things are unacceptable with regard to data quality.

The following example uses the same three validation functions as before but, this time, we use them directly on the data. The validation functions act as a filter, passing data through unless execution is stopped by failing validations beyond the set threshold. In this workflow, by default, an error will occur if there is a single โ€˜failโ€™ test unit in any validation step:

dplyr::tibble(
    a = c(5, 7, 6, 5, NA, 7),
    b = c(6, 1, 0, 6,  0, 7)
  ) %>%
  col_vals_between(
    a, 1, 9,
    na_pass = TRUE
  ) %>%
  col_vals_lt(
    c, 12,
    preconditions = ~ . %>% dplyr::mutate(c = a + b)
  ) %>%
  col_is_numeric(c(a, b))
Error: Exceedance of failed test units where values in `c` should have been < `12`.
The `col_vals_lt()` validation failed beyond the absolute threshold level (1).
* failure level (2) >= failure threshold (1) 

We can downgrade this error to a warning with the warn_on_fail() helper function (assigning it to actions). In this way, the data will always be returned, but warnings will appear.

# The `warn_on_fail()` function is a nice
# shortcut for `action_levels(warn_at = 1)`;
# it works great in this data checking workflow
# (and the threshold can still be adjusted)
dplyr::tibble(
    a = c(5, 7, 6, 5, NA, 7),
    b = c(6, 1, 0, 6,  0, 7)
  ) %>%
  col_vals_between(
    a, 1, 9,
    na_pass = TRUE,
    actions = warn_on_fail()
  ) %>%
  col_vals_lt(
    c, 12,
    preconditions = ~ . %>% dplyr::mutate(c = a + b),
    actions = warn_on_fail()
  ) %>%
  col_is_numeric(
    c(a, b),
    actions = warn_on_fail()
  )
#> # A tibble: 6 x 2
#>       a     b
#>   <dbl> <dbl>
#> 1     5     6
#> 2     7     1
#> 3     6     0
#> 4     5     6
#> 5    NA     0
#> 6     7     7

Warning message:
Exceedance of failed test units where values in `c` should have been < `12`.
The `col_vals_lt()` validation failed beyond the absolute threshold level (1).
* failure level (2) >= failure threshold (1) 

Should you need more fine-grained thresholds and resultant actions, the action_levels() function can be used to specify multiple failure thresholds and side effects for each failure state. However, with warn_on_fail() and stop_on_fail() (applied by default, with stop_at = 1), you should have good enough options for this validation workflow.


VALIDATIONS IN R MARKDOWN DOCUMENTS

Using pointblank in an R Markdown workflow is enabled by default once the pointblank library is loaded. The framework allows for validation testing within specialized validation code chunks where the validate = TRUE option is set. Using pointblank validation functions on data in these marked code chunks will flag overall failure if the stop threshold is exceeded anywhere. All errors are reported in the validation code chunk after rendering the document to HTML, where green or red status buttons indicate whether all validations succeeded or failures occurred. Click them to reveal the otherwise hidden validation statements and any associated error messages.

The above R Markdown document is available as a template in the RStudio IDE; itโ€™s called Pointblank Validation.


TABLE INFORMATION

Table information can be synthesized in an information management workflow, giving us a snapshot of a data table we care to collect information on. The pointblank informant is fed a series of info_*() functions to define bits of information about a table. This info text can pertain to individual columns, the table as a whole, and whatever additional information makes sense for your organization. We can even glean little snippets of information (like column stats or sample values) from the target table with info_snippet() and the snip_*() functions and mix them into the data dictionary wherever they're needed.

Here is an example of how to use pointblank to incorporate pieces of info text into an informant object.

# Create a pointblank `informant` object, with the
# tibble as the target table. Use a few information
# functions and end with `incorporate()`. The informant
# will then show you information about the tibble.
informant <- 
  dplyr::tibble(
    a = c(5, 7, 6, 5, NA, 7),
    b = c(6, 1, 0, 6,  0, 7)
  ) %>%
  create_informant(
    label = "A very *simple* example.",
    tbl_name = "example_tbl"
  ) %>%
  info_tabular(
    description = "This two-column table is nothing all that
    interesting, but, it's fine for examples on **GitHub**
    `README` pages. Column names are `a` and `b`. ((Cool stuff))"
  ) %>%
  info_columns(
    columns = a,
    info = "This column has an `NA` value. [[Watch out!]]<<color: red;>>"
  ) %>%
  info_columns(
    columns = a,
    info = "Mean value is `{a_mean}`."
  ) %>%
  info_columns(
    columns = b,
    info = "Like column `a`. The lowest value is `{b_lowest}`."
  ) %>%
  info_columns(
    columns = b,
    info = "The highest value is `{b_highest}`."
  ) %>%
  info_snippet(
    snippet_name = "a_mean",
    fn = ~ . %>% .$a %>% mean(na.rm = TRUE) %>% round(2)
  ) %>%
  info_snippet(snippet_name = "b_lowest", fn = snip_lowest("b")) %>%
  info_snippet(snippet_name = "b_highest", fn = snip_highest("b")) %>%
  info_section(
    section_name = "further information", 
    `examples and documentation` = "Examples for how to use the
    `info_*()` functions (and many more) are available at the
    [**pointblank** site](https://rstudio.github.io/pointblank/)."
  ) %>%
  incorporate()

By printing the informant we get the table information report.

Here is a link to a hosted information report for the intendo::intendo_revenue table:

Information Report for intendo::intendo_revenue


TABLE SCANS

We can use the scan_data() function to generate a comprehensive summary of a tabular dataset. This allows us to quickly understand what's in the dataset and it helps us determine if there are any peculiarities within the data. Scanning the dplyr::storms dataset with scan_data(tbl = dplyr::storms) gives us an interactive HTML report. Here are a few of them, published in RPubs:

Table Scan of dplyr::storms

Table Scan of pointblank::game_revenue

Database tables can be used with scan_data() as well. Here are two examples using (1) the full_region table of the Rfam database (hosted publicly at mysql-rfam-public.ebi.ac.uk) and (2) the assembly table of the Ensembl database (hosted publicly at ensembldb.ensembl.org).

Rfam: full_region

Ensembl: assembly


OVERVIEW OF PACKAGE FUNCTIONS

There are many functions available in pointblank for understanding data quality and creating data documentation. Here is an overview of all of them, grouped by family. For much more information on these, visit the documentation website or take a Test Drive in the Posit Cloud project.


DISCUSSIONS

Let's talk about data validation and data documentation in pointblank Discussions! It's a great place to ask questions about how to use the package, discuss some ideas, engage with others, and much more!

INSTALLATION

Want to try this out? The pointblank package is available on CRAN:

install.packages("pointblank")

You can also install the development version of pointblank from GitHub:

devtools::install_github("rstudio/pointblank")

If you encounter a bug, have usage questions, or want to share ideas to make this package better, feel free to file an issue.


Code of Conduct

Please note that the gt project is released with a contributor code of conduct.
By participating in this project you agree to abide by its terms.

๐Ÿ“„ License

pointblank is licensed under the MIT license. See the LICENSE.md file for more details.

ยฉ Posit Software, PBC.

๐Ÿ›๏ธ Governance

This project is primarily maintained by Rich Iannone. Other authors may occasionally assist with some of these duties.


More Repositories

1

cheatsheets

Posit Cheat Sheets - Can also be found at https://posit.co/resources/cheatsheets/.
TeX
5,758
star
2

shiny

Easy interactive web applications with R
R
5,341
star
3

rstudio

RStudio is an integrated development environment (IDE) for R
Java
4,432
star
4

bookdown

Authoring Books and Technical Documents with R Markdown
JavaScript
3,743
star
5

rmarkdown

Dynamic Documents for R
R
2,737
star
6

gt

Easily generate information-rich, publication-quality tables from R
R
2,019
star
7

shiny-examples

JavaScript
1,959
star
8

blogdown

Create Blogs and Websites with R Markdown
R
1,724
star
9

reticulate

R Interface to Python
R
1,675
star
10

webinars

Code and slides for RStudio webinars
HTML
1,510
star
11

rticles

LaTeX Journal Article Templates for R Markdown
TeX
1,402
star
12

plumber

Turn your R code into a web API.
R
1,390
star
13

tensorflow

TensorFlow for R
R
1,328
star
14

renv

renv: Project environments for R.
R
995
star
15

pagedown

Paginate the HTML Output of R Markdown with CSS for Print
R
883
star
16

shinydashboard

Shiny Dashboarding framework
CSS
852
star
17

keras3

R Interface to Keras
R
835
star
18

flexdashboard

Easy interactive dashboards for R
JavaScript
811
star
19

leaflet

R Interface to Leaflet Maps
JavaScript
799
star
20

rmarkdown-book

R Markdown: The Definitive Guide (published by Chapman & Hall/CRC in July 2018)
RMarkdown
738
star
21

rstudio-conf

Materials for rstudio::conf
HTML
721
star
22

shiny-server

Host Shiny applications over the web.
JavaScript
712
star
23

ggvis

Interactive grammar of graphics for R
R
709
star
24

learnr

Interactive Tutorials with R Markdown
R
709
star
25

RStartHere

A guide to some of the most useful R Packages that we know about
R
662
star
26

py-shiny

Shiny for Python
Python
627
star
27

DT

R Interface to the jQuery Plug-in DataTables
JavaScript
599
star
28

rmarkdown-cookbook

R Markdown Cookbook. A range of tips and tricks to make better use of R Markdown.
RMarkdown
577
star
29

blastula

Easily send great-looking HTML email messages from R
R
547
star
30

r2d3

R Interface to D3 Visualizations
R
516
star
31

bookdown-demo

A minimal book example using bookdown
CSS
476
star
32

hex-stickers

RStudio hex stickers
R
463
star
33

bslib

Tools for theming Shiny and R Markdown via Bootstrap 3, 4, or 5.
SCSS
461
star
34

distill

Distill for R Markdown
HTML
423
star
35

packrat

Packrat is a dependency management system for R
R
394
star
36

tufte

Tufte Styles for R Markdown Documents
R
385
star
37

dygraphs

R interface to dygraphs
JavaScript
365
star
38

revealjs

R Markdown Format for reveal.js Presentations
JavaScript
316
star
39

pins-r

Pin, discover, and share resources
R
314
star
40

fontawesome

Easily insert FontAwesome icons into R Markdown docs and Shiny apps
R
294
star
41

crosstalk

Inter-htmlwidget communication for R (with and without Shiny)
JavaScript
287
star
42

pool

Object Pooling in R
R
252
star
43

tinytex-releases

Windows/macOS/Linux binaries and installation methods of TinyTeX
PowerShell
251
star
44

config

config package for R
R
247
star
45

thematic

Theme ggplot2, lattice, and base graphics based on a few simple settings.
R
242
star
46

Intro

Course materials for "Introduction to Data Science with R", a video course by RStudio and O'Reilly Media
R
234
star
47

shinytest

Automated testing for shiny apps
JavaScript
225
star
48

shinymeta

Record and expose Shiny app logic using metaprogramming
R
223
star
49

nomnoml

Sassy 'UML' Diagrams for R
JavaScript
220
star
50

shinyuieditor

A GUI for laying out a Shiny application that generates clean and human-readable UI code
JavaScript
218
star
51

httpuv

HTTP and WebSocket server package for R
C
217
star
52

htmltools

Tools for HTML generation and output
R
201
star
53

promises

A promise library for R
R
201
star
54

vetiver-r

Version, share, deploy, and monitor models
R
181
star
55

rstudioapi

Safely access RStudio's API (when available)
R
161
star
56

concept-maps

Concept maps for all things data science
HTML
161
star
57

gradethis

Tools for teachers to use with learnr
R
161
star
58

chromote

Chrome Remote Interface for R
R
155
star
59

master-the-tidyverse

Course contents for Master the Tidyverse
155
star
60

shinythemes

Themes for Shiny
R
152
star
61

ShinyDeveloperConference

Materials collected from the First Shiny Developer Conference Palo Alto, CA January 30-31 2016
HTML
152
star
62

shiny-gallery

Code and other documentation for apps in the Shiny Gallery โœจ
HTML
147
star
63

sortable

R htmlwidget for Sortable.js
R
124
star
64

reactlog

Shiny Reactivity Visualizer
JavaScript
123
star
65

r-docker

Docker images for R
Dockerfile
121
star
66

rsconnect

Publish Shiny Applications, RMarkdown Documents, Jupyter Notebooks, Plumber APIs, and more
R
120
star
67

redx

dynamic nginx configuration
Lua
118
star
68

bigdataclass

Two-day workshop that covers how to use R to interact databases and Spark
R
114
star
69

r-system-requirements

System requirements for R packages
Shell
111
star
70

shinyloadtest

Tools for load testing Shiny applications
HTML
110
star
71

shinyvalidate

Input validation package for the Shiny web framework
JavaScript
110
star
72

shinyapps

Deploy Shiny applications to ShinyApps
110
star
73

webshot2

Take screenshots of web pages from R
R
109
star
74

shinytest2

R
103
star
75

miniUI

R
102
star
76

sass

Sass compiler package for R
C++
102
star
77

keras-customer-churn

Customer Churn Shiny Application
R
99
star
78

r-builds

an opinionated environment for compiling R
Shell
95
star
79

r-manuals

A re-styled version of the R manuals
R
88
star
80

addinexamples

An R package showcasing how RStudio addins can be registered and used.
R
86
star
81

shinyapps-package-dependencies

Collection of bash scripts that install R package system dependencies
R
74
star
82

markdown

The first generation of Markdown rendering for R (born in 2012). Originally based on the C library sundown. Now based on commonmark. Note that this package is markdown, not *rmarkdown*.
R
72
star
83

webdriver

WebDriver client in R
R
69
star
84

R-Websockets

HTML 5 Websockets implementation for R
R
68
star
85

beyond-dashboard-fatigue

Materials for the RStudio webinar 'Beyond Dashboard Fatigue'
R
66
star
86

cloudml

R interface to Google Cloud Machine Learning Engine
R
65
star
87

rstudio-docker-products

Docker images for RStudio Professional Products
Shell
64
star
88

shinylive

Run Shiny on Python (compiled to wasm) in the browser
TypeScript
61
star
89

rstudio-conf-2022-program

rstudio::conf(2022, "program")
R
60
star
90

bookdown.org

Source documents to generate the bookdown.org website
R
59
star
91

vetiver-python

Version, share, deploy, and monitor models.
Python
59
star
92

education.rstudio.com

CSS
58
star
93

tfestimators

R interface to TensorFlow Estimators
R
57
star
94

connections

https://rstudio.github.io/connections/
R
56
star
95

tfprobability

R interface to TensorFlow Probability
R
54
star
96

sparkDemos

HTML
53
star
97

swagger

Swagger is a collection of HTML, Javascript, and CSS assets that dynamically generate beautiful documentation from a Swagger-compliant API.
HTML
53
star
98

shiny-incubator

Examples and ideas that don't belong in the core Shiny package and aren't officially supported.
JavaScript
53
star
99

pins-python

Python
50
star
100

leaflet.mapboxgl

Extends the R Leaflet package with a Mapbox GL JS plugin to allow easy drawing of vector tile layers.
R
50
star