• Stars
    star
    147
  • Rank 251,347 (Top 5 %)
  • Language
    HTML
  • License
    Other
  • Created over 4 years ago
  • Updated 7 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

validate your data and create nice reports straight from R

data.validator data.validator logo

Validate your data and create nice reports straight from R.

R-CMD-check codecov cranlogs total

Description

data.validator is a package for scalable and reproducible data validation. It provides:

  • Functions for validating datasets in %>% pipelines: validate_if, validate_cols and validate_rows
  • Predicate functions from assertr package, like in_set, within_bounds, etc.
  • Functions for creating user-friendly reports that you can send to email, store in logs folder, or generate automatically with RStudio Connect.

Installation

Install from CRAN:

install.packages("data.validator")

or the latest development version:

remotes::install_github("Appsilon/data.validator")

Data validation

Validation cycle is simple:

  1. Create report object.
  2. Prepare your dataset. You can load it, preprocess and then run validate() pipeline.
  3. Validate your datasets.
    • Start validation block with validate() function. It adds new section to the report.
    • Use validate_* functions and predicates to validate the data. You can create your custom predicates. See between() example.
    • Add assertion results to the report with add_results()
  4. Print the results or generate HTML report.
library(assertr)
library(magrittr)
library(data.validator)

report <- data_validation_report()

validate(mtcars, name = "Verifying cars dataset") %>%
  validate_if(drat > 0, description = "Column drat has only positive values") %>%
  validate_cols(in_set(c(0, 2)), vs, am, description = "vs and am values equal 0 or 2 only") %>%
  validate_cols(within_n_sds(1), mpg, description = "mpg within 1 sds") %>%
  validate_rows(num_row_NAs, within_bounds(0, 2), vs, am, mpg, description = "not too many NAs in rows") %>%
  validate_rows(maha_dist, within_n_mads(10), everything(), description = "maha dist within 10 mads") %>%
  add_results(report)

between <- function(a, b) {
  function(x) { a <= x & x <= b }
}

validate(iris, name = "Verifying flower dataset") %>%
  validate_if(Sepal.Length > 0, description = "Sepal length is greater than 0") %>%
  validate_cols(between(0, 4), Sepal.Width, description = "Sepal width is between 0 and 4") %>%
  add_results(report)

print(report)

Reporting

Print results to the console:

print(report)

# Validation summary:
#  Number of successful validations: 1
#  Number of failed validations: 4
#  Number of validations with warnings: 1
#
# Advanced view:
#
# |table_name |description                                       |type    | total_violations|
# |:----------|:-------------------------------------------------|:-------|----------------:|
# |mtcars     |Column drat has only positive values              |success |               NA|
# |mtcars     |Column drat has only values larger than 3         |error   |                4|
# |mtcars     |Each row sum for am:vs columns is less or equal 1 |error   |                7|
# |mtcars     |For wt and qsec we have: abs(col) < 2 * sd(col)   |error   |                4|
# |mtcars     |vs and am values equal 0 or 2 only                |error   |               27|
# |mtcars     |vs and am values should equal 3 or 4              |warning |               24|

Save as HTML report

save_report(report)

Full examples

Checking key columns uniqueness

Common step in data validation is assuring that key columns are unique and not empty.

Test dataset for preparing the validation schema can be created with fixtuRes package.

library(fixtuRes)
library(magrittr)
library(assertr)
library(data.validator)

my_mock_generator <- fixtuRes::MockDataGenerator$new("path-to-my-configuration.yml")
my_data_frame <- my_mock_generator$get_data("my_data_frame", 10)

report <- data.validator::data_validation_report()

validate(my_data_frame, name = "Verifying data uniqueness") %>%
  validate_if(is_uniq(id), description = "ID column is unique") %>%
  validate_if(!is.na(id) & id != "", description = "ID column is not empty") %>%
  validate_if(is.character(code), description = "CODE column is string") %>%
  validate_rows(col_concat, is_uniq, code, type, description = "CODE and TYPE combination is unique") %>%
  add_results(report)

print(report)

# Validation summary:
#  Number of successful validations: 4
#  Number of failed validations: 0
#  Number of validations with warnings: 0
#
# Advanced view:
#
#
# |table_name              |description                         |type    | total_violations|
# |:-----------------------|:-----------------------------------|:-------|----------------:|
# |Verifying data uniqness |CODE and TYPE combination is unique |success |               NA|
# |Verifying data uniqness |CODE column is string               |success |               NA|
# |Verifying data uniqness |ID column is not empty              |success |               NA|
# |Verifying data uniqness |ID column is unique                 |success |               NA|

Custom reporting on leaflet map

Other examples

Using custom report templates

In order to generate rmarkdown report data.validator uses predefined report template. You may find it in inst/rmarkdown/templates/standard/skeleton/skeleton.Rmd.

The report contains basic requirements for each report template used by save_report function:

  • defining params
params:
  generate_report_html: !expr function(...) {}
  extra_params: list()
  • calling content renderer chunk
```{r generate_report, echo = FALSE}
params$generate_report_html(params$extra_params)
```

If you want to use the template as a base you can use RStudio. Load the package and use File -> New File -> R Markdown -> From template -> Simple structure for HTML report summary. Then modify the template adding custom title, or graphics with leaving the below points unchanged and specify the path inside save_report's template parameter.

How the package can be used in production?

The package was successfully used by Appsilon in production environment for protecting Shiny Apps against being run on incorrect data.

The workflow was based on the below steps:

  1. Running RStudio Connect Scheduler daily.

  2. Scheduler sources the data from PostgreSQL table and validates it based on predefined rules.

  3. Based on validation results a new data.validator report is created.

4a. When data is violated:

  • data provider and person responsible for data quality receives report via email

  • thanks to assertr functionality, the report is easily understandable both for technical, and non-technical person

  • data provider makes required data fixes

4b. When data is correct:

  • a specific trigger is sent in order to reload Shiny data

Working example

Check the simple example of scheduled validation and storing data as pin here: connect_validation_workflow

The workflow is presented on below graphics

How to contribute?

If you want to contribute to this project please submit a regular PR, once you're done with new feature or bug fix.

Reporting a bug is also helpful - please use GitHub issues and describe your problem as detailed as possible.

Appsilon

Appsilon is a Posit (formerly RStudio) Full Service Certified Partner.
Learn more at appsilon.com.

Get in touch [email protected]

Explore the Rhinoverse - a family of R packages built around Rhino!

We are hiring!

More Repositories

1

shiny.semantic

Shiny support for powerful Fomantic UI library.
R
489
star
2

rhino

Build high quality, enterprise-grade Shiny apps at speed
R
284
star
3

shiny.fluent

Microsoft's Fluent UI for Shiny apps
R
271
star
4

semantic.dashboard

Quick, beautiful and customizable dashboard template for Shiny based on shiny.semantic and Fomantic UI.
R
254
star
5

shiny.router

A minimalistic router for your Shiny apps.
R
253
star
6

shiny.i18n

Shiny applications internationalization made easy
R
168
star
7

styleguide

[DEPRECATED] Style guides at Appsilon.
100
star
8

shiny.react

Use React in Shiny applications.
JavaScript
96
star
9

shiny.collections

This project is deprecated and no longer supported. (Google Docs-like live collaboration in Shiny)
R
76
star
10

awesome-interview-questions

73
star
11

r-lambda-workflow

Runtime for running R on AWS Lambda.
Python
67
star
12

shiny.telemetry

Easy logging of users activity and session events of your Shiny App
R
65
star
13

shiny.worker

Intra-session reactivity in Shiny
R
61
star
14

shiny.info

Display simple diagnostic info of your Shiny app
R
60
star
15

shiny.blueprint

Blueprint - React-based UI toolkit for Shiny Apps
R
44
star
16

reactable.extras

Extra features for reactable package
R
38
star
17

tapyr-template

Tapyr template for PyShiny applications
Python
37
star
18

shiny.benchmark

Tools to measure performance improvements in shiny apps
R
31
star
19

mbaza

Save 99% of Your Time Classifying Camera-Trap Footage. Completely Free.
TypeScript
30
star
20

webr.bundle

Bundle Shiny Applications for serving with WebR.
Rust
27
star
21

rhino-showcase

An example app built with Rhino to showcase its features
R
27
star
22

shiny.emptystate

Empty state components for Shiny
R
26
star
23

rhino-masterclass

R
23
star
24

shiny.tictoc

Measuring shiny performance in the browser.
JavaScript
21
star
25

fastai-in-r

How to use Fast.ai from R - eRum 2020 presentation code
HTML
20
star
26

shiny.gosling

R Shiny wrapper for Gosling.js - Grammar-based Toolkit for Scalable and Interactive Genomics Data Visualization
R
17
star
27

shiny.molstar

R Shiny wrapper for Mol* (/'molstar/) - A visualization toolkit of large-scale molecular data
R
17
star
28

shiny.semantic-hackathon-2020

Appsilon shiny.semantic hackathon 2020
R
16
star
29

experimental-plumber-example

R
15
star
30

datascience-python

Introduction to Data Science in Python by Appsilon
HTML
11
star
31

community-gists

In this repository, you will find all the code from our blog posts, social posts, videos and so on.
R
11
star
32

r2d3

R domain specific language for D3.js
CoffeeScript
10
star
33

py_shiny_semantic

Semantic UI made available in Shiny-for-Python
Python
10
star
34

rhinoverse.dev

The landing page for Rhinoverse: a collection of open-source R packages for enterprise Shiny apps.
JavaScript
10
star
35

data.validator-shiny.fluent-report

Proof of concept R/Shiny app built using shiny.fluent components. App built for RStudio Shiny Contest 2021
R
9
star
36

quake_explorer_app

Application for shiny.fluent + imola blog post. The app let the user explore the occurrence of earthquakes.
R
8
star
37

shiny.users.demo

8
star
38

box.linters

lintr-compatible linters for box modules in R
R
8
star
39

eRum2020

7
star
40

experimental-fda-submission-4-podman

Dockerfile
7
star
41

natural_language_queries

Example Shiny implementation of natural language queries from shinyconf-2023
R
7
star
42

rhino-fda-pilot

R
7
star
43

LogAnalyzer

A LogAnalyzer app for checking logs in Posit Connect
R
6
star
44

ansible-rstudio-workbench

Set up (the latest version of) RStudio Workbench in Debian-like systems
Jinja
6
star
45

respiratory_disease_pyshiny

PythonShiny clone of respiratory_disease_app_sprint
Python
5
star
46

rhino-workshop

Rhino Workshop App (2022-05-13)
R
5
star
47

serengeti_try_it_yourself

Try our model on your own images - check if it can spot any of the 53 species from Serengeti in any of your favourite photos
Jupyter Notebook
5
star
48

gabon_wildlife_training

Jupyter Notebook
4
star
49

shiny.layouts

Powerful layouts for your Shiny app: CSS grid, sidebar layout, vertical layout and more!
R
4
star
50

covid-hackathon

CoronaRank: modeling to assess the likelihood of coronavirus exposure
R
4
star
51

ci.example

Sample R package project with CI configured using CircleCI.
R
4
star
52

satrdays-2019-workshop

Materials for SatRday shiny workshop
R
4
star
53

user2017

Presentations from useR 2017
4
star
54

highlighter

htmlwidgets wrapper for highlight.js
CSS
4
star
55

ansible-role-template

Jinja
3
star
56

useR-2022-shiny-tutorial

HTML
3
star
57

readme_rmd_template

README.Rmd template for R packages.
CSS
3
star
58

testing-workshop

Repository created for the hands-on Appsilon workshops about adding tests to the application. It is based on {rhino} and Destination Overview template.
R
3
star
59

verified.installation

R package that helps installing dependencies in Docker images
R
3
star
60

ansible-rstudio-package-manager

Jinja
3
star
61

auth0api

R package for auth0
R
3
star
62

readmebuilder

R
2
star
63

crossfilter-demo

Dynamic filtering in Shiny dashboard - basic example.
R
2
star
64

weighted-picker-input

JavaScript
2
star
65

persistent-data-storage-sqlite

PoC showcasing persistent data and app-state storage
R
2
star
66

latin-r-2023

R
2
star
67

NLPShiny

Jupyter Notebook
2
star
68

ansible-r

2
star
69

dane.gov.pl

Repozytorium ze skryptami walidacyjnymi dla danych z portalu https://dane.gov.pl
R
2
star
70

energy-management-shiny-fluent-app

R
2
star
71

.github

2
star
72

hack4env

Jupyter Notebook
1
star
73

presentations

HTML
1
star
74

shiny-conf-globe

Data visualization of ShinyConf 2023 participants
HTML
1
star
75

shiny-papers-pilot

First implementation of a shiny paper
R
1
star
76

website-cdn

HTML
1
star
77

ansible-rstudio-connect

Ansible role for setting up RStudio/Posit Connect
Jinja
1
star
78

dynamic-shiny-modules

R
1
star
79

copepods-lipid-content

Jupyter Notebook
1
star
80

whyR2017

Presentations from Why R conference
1
star
81

Hexsticker

R
1
star
82

shinynlq

R
1
star
83

terraform-module-template

HCL
1
star
84

shiny-for-python-drawflow

CSS
1
star
85

modulesExtendeR

Wrapper around klmr/modules to support relative paths.
R
1
star
86

vaccines-dashboard

Demo for the shiny.blueprint package
R
1
star
87

ansible-mount-efs

Jinja
1
star
88

erum2018

Materials from Taking inspirations from proven frontend frameworks to add to Shiny with 6 new packages
1
star
89

terraform-aws-rstudio-workbench-example

HCL
1
star
90

py_shiny_semantic_examples

Demo applications using Shiny for Python with Shiny Semantic
Python
1
star
91

cypress-masterclass

1
star
92

example-cypress-shiny-ci

JavaScript
1
star
93

example-cypress-ci

JavaScript
1
star
94

TicTacJoe

This repo holds an app with TicTacJoe - a reinforcement learning example, in the game of Tic Tac Toe
HTML
1
star
95

semantic.assets

Assets for shiny.semantic
CSS
1
star
96

saving-ram-with-arrow

HTML
1
star