• Stars
    star
    295
  • Rank 140,902 (Top 3 %)
  • Language
    R
  • License
    Other
  • Created over 5 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Tidy data frames and expressions with statistical summaries šŸ“œ

{statsExpressions}: Tidy dataframes and expressions with statistical details

Status Usage Miscellaneous
R build status Total downloads Codecov
lifecycle Daily downloads DOI

Introduction

The {statsExpressions} package has two key aims:

  • to provide a consistent syntax to do statistical analysis with tidy data (in pipe-friendly manner),
  • to provide statistical expressions (pre-formatted in-text statistical results) for plotting functions.

Statistical packages exhibit substantial diversity in terms of their syntax and expected input type. This can make it difficult to switch from one statistical approach to another. For example, some functions expect vectors as inputs, while others expect dataframes. Depending on whether it is a repeated measures design or not, different functions might expect data to be in wide or long format. Some functions can internally omit missing values, while other functions error in their presence. Furthermore, if someone wishes to utilize the objects returned by these packages downstream in their workflow, this is not straightforward either because even functions from the same package can return a list, a matrix, an array, a dataframe, etc., depending on the function.

This is where {statsExpressions} comes in: It can be thought of as a unified portal through which most of the functionality in these underlying packages can be accessed, with a simpler interface and no requirement to change data format.

This package forms the statistical processing backend for ggstatsplot package.

For more documentation, see the dedicated website.

Installation

Type Source Command
Release CRAN Status install.packages("statsExpressions")
Development Project Status pak::pak("IndrajeetPatil/statsExpressions")

Citation

The package can be cited as:

citation("statsExpressions")
To cite package 'statsExpressions' in publications use:

  Patil, I., (2021). statsExpressions: R Package for Tidy Dataframes
  and Expressions with Statistical Details. Journal of Open Source
  Software, 6(61), 3236, https://doi.org/10.21105/joss.03236

A BibTeX entry for LaTeX users is

  @Article{,
    doi = {10.21105/joss.03236},
    url = {https://doi.org/10.21105/joss.03236},
    year = {2021},
    publisher = {{The Open Journal}},
    volume = {6},
    number = {61},
    pages = {3236},
    author = {Indrajeet Patil},
    title = {{statsExpressions: {R} Package for Tidy Dataframes and Expressions with Statistical Details}},
    journal = {{Journal of Open Source Software}},
  }

General Workflow

Summary of functionality

Summary of available analyses

Test Function
one-sample t-test one_sample_test()
two-sample t-test two_sample_test()
one-way ANOVA oneway_anova()
correlation analysis corr_test()
contingency table analysis contingency_table()
meta-analysis meta_analysis()
pairwise comparisons pairwise_comparisons()

Summary of details available for analyses

Analysis Hypothesis testing Effect size estimation
(one/two-sample) t-test āœ… āœ…
one-way ANOVA āœ… āœ…
correlation āœ… āœ…
(one/two-way) contingency table āœ… āœ…
random-effects meta-analysis āœ… āœ…

Summary of supported statistical approaches

Description Parametric Non-parametric Robust Bayesian
Between group/condition comparisons āœ… āœ… āœ… āœ…
Within group/condition comparisons āœ… āœ… āœ… āœ…
Distribution of a numeric variable āœ… āœ… āœ… āœ…
Correlation between two variables āœ… āœ… āœ… āœ…
Association between categorical variables āœ… āœ… āŒ āœ…
Equal proportions for categorical variable levels āœ… āœ… āŒ āœ…
Random-effects meta-analysis āœ… āŒ āœ… āœ…

Tidy dataframes from statistical analysis

To illustrate the simplicity of this syntax, letā€™s say we want to run a one-way ANOVA. If we first run a non-parametric ANOVA and then decide to run a robust ANOVA instead, the syntax remains the same and the statistical approach can be modified by changing a single argument:

mtcars %>% oneway_anova(cyl, wt, type = "nonparametric")
#> # A tibble: 1 Ɨ 15
#>   parameter1 parameter2 statistic df.error   p.value
#>   <chr>      <chr>          <dbl>    <int>     <dbl>
#> 1 wt         cyl             22.8        2 0.0000112
#>   method                       effectsize      estimate conf.level conf.low
#>   <chr>                        <chr>              <dbl>      <dbl>    <dbl>
#> 1 Kruskal-Wallis rank sum test Epsilon2 (rank)    0.736       0.95    0.624
#>   conf.high conf.method          conf.iterations n.obs expression
#>       <dbl> <chr>                          <int> <int> <list>    
#> 1         1 percentile bootstrap             100    32 <language>

mtcars %>% oneway_anova(cyl, wt, type = "robust")
#> # A tibble: 1 Ɨ 12
#>   statistic    df df.error p.value
#>       <dbl> <dbl>    <dbl>   <dbl>
#> 1      12.7     2     12.2 0.00102
#>   method                                           
#>   <chr>                                            
#> 1 A heteroscedastic one-way ANOVA for trimmed means
#>   effectsize                         estimate conf.level conf.low conf.high
#>   <chr>                                 <dbl>      <dbl>    <dbl>     <dbl>
#> 1 Explanatory measure of effect size     1.05       0.95    0.843      1.50
#>   n.obs expression
#>   <int> <list>    
#> 1    32 <language>

All possible output dataframes from functions are tabulated here: https://indrajeetpatil.github.io/statsExpressions/articles/web_only/dataframe_outputs.html

Needless to say this will also work with the kable function to generate a table:

set.seed(123)

# one-sample robust t-test
# we will leave `expression` column out; it's not needed for using only the dataframe
mtcars %>%
  one_sample_test(wt, test.value = 3, type = "robust") %>%
  dplyr::select(-expression) %>%
  knitr::kable()
statistic p.value n.obs method effectsize estimate conf.level conf.low conf.high
1.179181 0.275 32 Bootstrap-t method for one-sample test Trimmed mean 3.197 0.95 2.854246 3.539754

These functions are also compatible with other popular data manipulation packages.

For example, letā€™s say we want to run a one-sample t-test for all levels of a certain grouping variable. We can use dplyr to do so:

# for reproducibility
set.seed(123)
library(dplyr)

# grouped operation
# running one-sample test for all levels of grouping variable `cyl`
mtcars %>%
  group_by(cyl) %>%
  group_modify(~ one_sample_test(.x, wt, test.value = 3), .keep = TRUE) %>%
  ungroup()
#> # A tibble: 3 Ɨ 16
#>     cyl    mu statistic df.error  p.value method            alternative
#>   <dbl> <dbl>     <dbl>    <dbl>    <dbl> <chr>             <chr>      
#> 1     4     3    -4.16        10 0.00195  One Sample t-test two.sided  
#> 2     6     3     0.870        6 0.418    One Sample t-test two.sided  
#> 3     8     3     4.92        13 0.000278 One Sample t-test two.sided  
#>   effectsize estimate conf.level conf.low conf.high conf.method
#>   <chr>         <dbl>      <dbl>    <dbl>     <dbl> <chr>      
#> 1 Hedges' g    -1.16        0.95   -1.88     -0.402 ncp        
#> 2 Hedges' g     0.286       0.95   -0.388     0.937 ncp        
#> 3 Hedges' g     1.24        0.95    0.544     1.91  ncp        
#>   conf.distribution n.obs expression
#>   <chr>             <int> <list>    
#> 1 t                    11 <language>
#> 2 t                     7 <language>
#> 3 t                    14 <language>

Using expressions in custom plots

Note that expression here means a pre-formatted in-text statistical result. In addition to other details contained in the dataframe, there is also a column titled expression, which contains expression with statistical details and can be displayed in a plot.

For all statistical test expressions, the default template attempt to follow the gold standard for statistical reporting.

For example, here are results from Welchā€™s t-test:

Letā€™s load the needed library for visualization:

library(ggplot2)

Expressions for centrality measure

Note that when used in a geometric layer, the expression need to be parsed.

# displaying mean for each level of `cyl`
centrality_description(mtcars, cyl, wt) |>
  ggplot(aes(cyl, wt)) +
  geom_point() +
  geom_label(aes(label = expression), parse = TRUE)

Here are a few examples for supported analyses.

Expressions for one-way ANOVAs

The returned data frame will always have a column called expression.

Assuming there is only a single result you need to display in a plot, to use it in a plot, you have two options:

  • extract the expression from the list column (results_data$expression[[1]]) without parsing
  • use the list column as is, in which case you will need to parse it (parse(text = results_data$expression))

If you want to display more than one expression in a plot, you will have to parse them.

Between-subjects design

set.seed(123)
library(ggridges)

results_data <- oneway_anova(iris, Species, Sepal.Length, type = "robust")

# create a ridgeplot
ggplot(iris, aes(x = Sepal.Length, y = Species)) +
  geom_density_ridges() +
  labs(
    title = "A heteroscedastic one-way ANOVA for trimmed means",
    subtitle = results_data$expression[[1]]
  )

Within-subjects design

set.seed(123)
library(WRS2)
library(ggbeeswarm)

results_data <- oneway_anova(
  WineTasting,
  Wine,
  Taste,
  paired = TRUE,
  subject.id = Taster,
  type = "np"
)

ggplot2::ggplot(WineTasting, aes(Wine, Taste, color = Wine)) +
  geom_quasirandom() +
  labs(
    title = "Friedman's rank sum test",
    subtitle = parse(text = results_data$expression)
  )

Expressions for two-sample tests

Between-subjects design

set.seed(123)
library(gghalves)

results_data <- two_sample_test(ToothGrowth, supp, len)

ggplot(ToothGrowth, aes(supp, len)) +
  geom_half_dotplot() +
  labs(
    title = "Two-Sample Welch's t-test",
    subtitle = parse(text = results_data$expression)
  )

Within-subjects design

set.seed(123)
library(tidyr)
library(PairedData)
data(PrisonStress)

# get data in tidy format
df <- pivot_longer(PrisonStress, starts_with("PSS"), names_to = "PSS", values_to = "stress")

results_data <- two_sample_test(
  data = df,
  x = PSS,
  y = stress,
  paired = TRUE,
  subject.id = Subject,
  type = "np"
)

# plot
paired.plotProfiles(PrisonStress, "PSSbefore", "PSSafter", subjects = "Subject") +
  labs(
    title = "Two-sample Wilcoxon paired test",
    subtitle = parse(text = results_data$expression)
  )

Expressions for one-sample tests

set.seed(123)

# dataframe with results
results_data <- one_sample_test(mtcars, wt, test.value = 3, type = "bayes")

# creating a histogram plot
ggplot(mtcars, aes(wt)) +
  geom_histogram(alpha = 0.5) +
  geom_vline(xintercept = mean(mtcars$wt), color = "red") +
  labs(subtitle = parse(text = results_data$expression))

Expressions for correlation analysis

Letā€™s look at another example where we want to run correlation analysis:

set.seed(123)

# dataframe with results
results_data <- corr_test(mtcars, mpg, wt, type = "nonparametric")

# create a scatter plot
ggplot(mtcars, aes(mpg, wt)) +
  geom_point() +
  geom_smooth(method = "lm", formula = y ~ x) +
  labs(
    title = "Spearman's rank correlation coefficient",
    subtitle = parse(text = results_data$expression)
  )

Expressions for contingency table analysis

For categorical/nominal data - one-sample:

set.seed(123)

# dataframe with results
results_data <- contingency_table(
  as.data.frame(table(mpg$class)),
  Var1,
  counts = Freq,
  type = "bayes"
)

# create a pie chart
ggplot(as.data.frame(table(mpg$class)), aes(x = "", y = Freq, fill = factor(Var1))) +
  geom_bar(width = 1, stat = "identity") +
  theme(axis.line = element_blank()) +
  # cleaning up the chart and adding results from one-sample proportion test
  coord_polar(theta = "y", start = 0) +
  labs(
    fill = "Class",
    x = NULL,
    y = NULL,
    title = "Pie Chart of class (type of car)",
    caption = parse(text = results_data$expression)
  )

You can also use these function to get the expression in return without having to display them in plots:

set.seed(123)

# Pearson's chi-squared test of independence
contingency_table(mtcars, am, vs)$expression[[1]]
#> list(chi["Pearson"]^2 * "(" * 1 * ")" == "0.91", italic(p) == 
#>     "0.34", widehat(italic("V"))["Cramer"] == "0.00", CI["95%"] ~ 
#>     "[" * "0.00", "1.00" * "]", italic("n")["obs"] == "32")

Expressions for meta-analysis

set.seed(123)
library(metaviz)
library(metaplus)

# dataframe with results
results_data <- meta_analysis(dplyr::rename(mozart, estimate = d, std.error = se))

# meta-analysis forest plot with results random-effects meta-analysis
viz_forest(
  x = mozart[, c("d", "se")],
  study_labels = mozart[, "study_name"],
  xlab = "Cohen's d",
  variant = "thick",
  type = "cumulative"
) +
  labs(
    title = "Meta-analysis of Pietschnig, Voracek, and Formann (2010) on the Mozart effect",
    subtitle = parse(text = results_data$expression)
  ) +
  theme(text = element_text(size = 12))

Customizing details to your liking

Sometimes you may not wish include so many details in the subtitle. In that case, you can extract the expression and copy-paste only the part you wish to include. For example, here only statistic and p-values are included:

set.seed(123)

# extracting detailed expression
(res_expr <- oneway_anova(iris, Species, Sepal.Length, var.equal = TRUE)$expression[[1]])
#> list(italic("F")["Fisher"](2, 147) == "119.26", italic(p) == 
#>     "1.67e-31", widehat(omega["p"]^2) == "0.61", CI["95%"] ~ 
#>     "[" * "0.53", "1.00" * "]", italic("n")["obs"] == "150")

# adapting the details to your liking
ggplot(iris, aes(x = Species, y = Sepal.Length)) +
  geom_boxplot() +
  labs(subtitle = ggplot2::expr(paste(
    NULL, italic("F"), "(", "2", ",", "147", ") = ", "119.26", ", ",
    italic("p"), " = ", "1.67e-31"
  )))

Summary of tests and effect sizes

Here a go-to summary about statistical test carried out and the returned effect size for each function is provided. This should be useful if one needs to find out more information about how an argument is resolved in the underlying package or if one wishes to browse the source code. So, for example, if you want to know more about how one-way (between-subjects) ANOVA, you can run ?stats::oneway.test in your R console.

centrality_description

Type Measure Function used
Parametric mean datawizard::describe_distribution()
Non-parametric median datawizard::describe_distribution()
Robust trimmed mean datawizard::describe_distribution()
Bayesian MAP datawizard::describe_distribution()

oneway_anova

between-subjects

Hypothesis testing

Type No.Ā of groups Test Function used
Parametric > 2 Fisherā€™s or Welchā€™s one-way ANOVA stats::oneway.test()
Non-parametric > 2 Kruskal-Wallis one-way ANOVA stats::kruskal.test()
Robust > 2 Heteroscedastic one-way ANOVA for trimmed means WRS2::t1way()
Bayes Factor > 2 Fisherā€™s ANOVA BayesFactor::anovaBF()

Effect size estimation

Type No.Ā of groups Effect size CI available? Function used
Parametric > 2 partial eta-squared, partial omega-squared Yes effectsize::omega_squared(), effectsize::eta_squared()
Non-parametric > 2 rank epsilon squared Yes effectsize::rank_epsilon_squared()
Robust > 2 Explanatory measure of effect size Yes WRS2::t1way()
Bayes Factor > 2 Bayesian R-squared Yes performance::r2_bayes()

within-subjects

Hypothesis testing

Type No.Ā of groups Test Function used
Parametric > 2 One-way repeated measures ANOVA afex::aov_ez()
Non-parametric > 2 Friedman rank sum test stats::friedman.test()
Robust > 2 Heteroscedastic one-way repeated measures ANOVA for trimmed means WRS2::rmanova()
Bayes Factor > 2 One-way repeated measures ANOVA BayesFactor::anovaBF()

Effect size estimation

Type No.Ā of groups Effect size CI available? Function used
Parametric > 2 partial eta-squared, partial omega-squared Yes effectsize::omega_squared(), effectsize::eta_squared()
Non-parametric > 2 Kendallā€™s coefficient of concordance Yes effectsize::kendalls_w()
Robust > 2 Algina-Keselman-Penfield robust standardized difference average Yes WRS2::wmcpAKP()
Bayes Factor > 2 Bayesian R-squared Yes performance::r2_bayes()

two_sample_test

between-subjects

Hypothesis testing

Type No.Ā of groups Test Function used
Parametric 2 Studentā€™s or Welchā€™s t-test stats::t.test()
Non-parametric 2 Mann-Whitney U test stats::wilcox.test()
Robust 2 Yuenā€™s test for trimmed means WRS2::yuen()
Bayesian 2 Studentā€™s t-test BayesFactor::ttestBF()

Effect size estimation

Type No.Ā of groups Effect size CI available? Function used
Parametric 2 Cohenā€™s d, Hedgeā€™s g Yes effectsize::cohens_d(), effectsize::hedges_g()
Non-parametric 2 r (rank-biserial correlation) Yes effectsize::rank_biserial()
Robust 2 Algina-Keselman-Penfield robust standardized difference Yes WRS2::akp.effect()
Bayesian 2 difference Yes bayestestR::describe_posterior()

within-subjects

Hypothesis testing

Type No.Ā of groups Test Function used
Parametric 2 Studentā€™s t-test stats::t.test()
Non-parametric 2 Wilcoxon signed-rank test stats::wilcox.test()
Robust 2 Yuenā€™s test on trimmed means for dependent samples WRS2::yuend()
Bayesian 2 Studentā€™s t-test BayesFactor::ttestBF()

Effect size estimation

Type No.Ā of groups Effect size CI available? Function used
Parametric 2 Cohenā€™s d, Hedgeā€™s g Yes effectsize::cohens_d(), effectsize::hedges_g()
Non-parametric 2 r (rank-biserial correlation) Yes effectsize::rank_biserial()
Robust 2 Algina-Keselman-Penfield robust standardized difference Yes WRS2::wmcpAKP()
Bayesian 2 difference Yes bayestestR::describe_posterior()

one_sample_test

Hypothesis testing

Type Test Function used
Parametric One-sample Studentā€™s t-test stats::t.test()
Non-parametric One-sample Wilcoxon test stats::wilcox.test()
Robust Bootstrap-t method for one-sample test WRS2::trimcibt()
Bayesian One-sample Studentā€™s t-test BayesFactor::ttestBF()

Effect size estimation

Type Effect size CI available? Function used
Parametric Cohenā€™s d, Hedgeā€™s g Yes effectsize::cohens_d(), effectsize::hedges_g()
Non-parametric r (rank-biserial correlation) Yes effectsize::rank_biserial()
Robust trimmed mean Yes WRS2::trimcibt()
Bayes Factor difference Yes bayestestR::describe_posterior()

corr_test

Hypothesis testing and Effect size estimation

Type Test CI available? Function used
Parametric Pearsonā€™s correlation coefficient Yes correlation::correlation()
Non-parametric Spearmanā€™s rank correlation coefficient Yes correlation::correlation()
Robust Winsorized Pearsonā€™s correlation coefficient Yes correlation::correlation()
Bayesian Bayesian Pearsonā€™s correlation coefficient Yes correlation::correlation()

contingency_table

two-way table

Hypothesis testing

Type Design Test Function used
Parametric/Non-parametric Unpaired Pearsonā€™s chi-squared test stats::chisq.test()
Bayesian Unpaired Bayesian Pearsonā€™s chi-squared test BayesFactor::contingencyTableBF()
Parametric/Non-parametric Paired McNemarā€™s chi-squared test stats::mcnemar.test()
Bayesian Paired No No

Effect size estimation

Type Design Effect size CI available? Function used
Parametric/Non-parametric Unpaired Cramerā€™s V Yes effectsize::cramers_v()
Bayesian Unpaired Cramerā€™s V Yes effectsize::cramers_v()
Parametric/Non-parametric Paired Cohenā€™s g Yes effectsize::cohens_g()
Bayesian Paired No No No

one-way table

Hypothesis testing

Type Test Function used
Parametric/Non-parametric Goodness of fit chi-squared test stats::chisq.test()
Bayesian Bayesian Goodness of fit chi-squared test (custom)

Effect size estimation

Type Effect size CI available? Function used
Parametric/Non-parametric Pearsonā€™s C Yes effectsize::pearsons_c()
Bayesian No No No

meta_analysis

Hypothesis testing and Effect size estimation

Type Test Effect size CI available? Function used
Parametric Meta-analysis via random-effects models beta Yes metafor::metafor()
Robust Meta-analysis via robust random-effects models beta Yes metaplus::metaplus()
Bayes Meta-analysis via Bayesian random-effects models beta Yes metaBMA::meta_random()

Usage in ggstatsplot

Note that these functions were initially written to display results from statistical tests on ready-made {ggplot2} plots implemented in {ggstatsplot}.

For detailed documentation, see the package website: https://indrajeetpatil.github.io/ggstatsplot/

Here is an example from {ggstatsplot} of what the plots look like when the expressions are displayed in the subtitle-

Acknowledgments

The hexsticker and the schematic illustration of general workflow were generously designed by Sarah Otterstetter (Max Planck Institute for Human Development, Berlin).

Contributing

Bug reports, suggestions, questions, and (most of all) contributions are welcome.

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

More Repositories

1

ggstatsplot

Enhancing {ggplot2} plots with statistical analysis šŸ“ŠšŸ“£
R
1,730
star
2

awesome-r-pkgtools

A curated list of awesome tools for R package development
HTML
153
star
3

kittyR

Bring those kitties and their meows to R console šŸˆšŸˆšŸˆ
R
47
star
4

broomExtra

Helpers for regression analyses using `{broom}` & `{easystats}` packages šŸ“ˆ šŸ”
R
46
star
5

pairwiseComparisons

Pairwise comparison tests for one-way designs šŸ”¬šŸ“
R
45
star
6

groupedstats

Grouped statistical analysis in a tidy way šŸ”šŸ’Ŗā™»
R
45
star
7

dry-r-package-development

Tips on not repeating yourself during R package development
HTML
22
star
8

RmarkdownTips

Collection of my tweets with RMarkdown/Quarto tips and tricks
HTML
21
star
9

tidyBF

Tidy wrapper around `BayesFactor` R package (archived)
R
18
star
10

preventive-r-package-care

How to build automation infrastructure to improve user experience and development workflow of R packages
HTML
17
star
11

intro-to-snapshot-testing

Introduction to snapshot (aka golden) testing (in R)
HTML
15
star
12

Advanced-R-exercises

Solutions to "Advanced R" (2nd ed.) exercises (unofficial) šŸ“–āœļø
HTML
15
star
13

R-Function-A-Day-book

Collection of tweets from "R Function A Day" account
HTML
15
star
14

ggstatsplot_slides

An Introductory Tutorial on {ggstatsplot}: {ggplot2} Plots with Statistics šŸ‘Øā€šŸ«
14
star
15

second-hardest-cs-thing

Thoughts on how to name things for software development
HTML
13
star
16

Practical-SQL-exercises

Exercises from "Practical SQL" (1st Ed.)
PLpgSQL
13
star
17

codeswitchR

A list of helpful R packages to learn other programming languages or workflows
HTML
8
star
18

IndrajeetPatil

My GitHub profile README
8
star
19

dotInternals

Prepend Internal Package Function Names With a Dot
R
5
star
20

collectify

Collect your favourite things in one place!
JavaScript
4
star
21

refactoring-ggstatsplot

Slides on lessons learned while refactoring {ggstatsplot} package
4
star
22

famtrack

Track your family history in a fun way!
Handlebars
3
star
23

Repro-workshop

HTML
1
star
24

minimalReprex

Minimal reprexes for {pkgdown} issues
R
1
star
25

save-indy

Save Indiana Jones using JavaScript!
JavaScript
1
star