autoEDA-resources
A list of software and papers related to automated Exploratory Data Analysis, including
- fast data exploration and visualization,
- augmented analytics,
- visualization recommendation and other tools that speed up data exploration (visual exploration in particular).
Pull requests with software, paper and conference presentations are welcome.
Software
R packages
My summary of R packages is in R Journal
General Packages
-
dataMaid (CRAN package) - automated checks of data validity.
-
DataExplorer (CRAN package) - automated data exploration (including univariate and bivariate plots, PCA) and treatment.
-
funModeling (CRAN package) - automated EDA, simple feature engineering and outlier detection.
-
SmartEDA (CRAN package) - automated generation of descriptive statistics and uni- and bivariate plots, parallel coordinate plots. Details can be found in a dedicated paper.
-
autoEDA (GitHub package) - automated EDA with uni- and bivariate plots. An article with an introduction can be found on LinkedIn.
-
visdat (CRAN package) - 6 exploratory/diagnostic plots for initial data analysis.
-
dlookr (CRAN package) - tools for data quality diagnosis, basic exploration and feature transformations.
-
xray (CRAN package) - first look at the data - distributions and anomalies. More in the blog post.
-
arsenal (CRAN package) - statistical summaries (models and exploration) and quick reporting.
-
RtutoR (CRAN package) - learning material with a automatic reports module. More at R-Bloggers.
-
exploreR (CRAN package) - exploration based on univariate linear regression.
-
summarytools (CRAN package) - table to summarise datasets and perform simple uni- and bivariate analyses.
-
inspectdf (CRAN package) - tools for column-wise exploration and comparison of data frames. Examples are provided in a README of the GitHub repo.
-
explore (CRAN package) - interactive Shiny app for comprehensive dataset exploration (including uni- and bivariate relationships, correlation analysis and simple modeling with decision trees) and stand-alone function for quick exploration. Examples are given in a vignette.
-
skimr (CRAN package) - well formatted summaries of data frames, vector and matrices. Examples are provided in a vignette.
-
janitor (CRAN package) - a tools for fast data cleaning. All functionalities are introduced in the vignette.
-
autoplotly (CRAN package) - a library for fast visualization of statistical results supported by ggfortify. Details can be found in the vignette or JOSS paper
-
brinton (CRAN package) - packages for quick exploration and visualization. Details can be found in the documentation.
-
AEDA (GitHub package) - summary statistics, correlation analysis, cluster analysis, PCA & other projections.
-
automatic-data-explorer (GitHub package) - basic EDA and creating Markdown reports from multiple R scripts.
-
xda (GitHub package) - basic data summaries.
-
modeler (GitHub package) - tools for exploration and pre-processing.
-
IEDA (GitHub package) - EDA simplified through interactive visualization.
-
dfvis (GitHub package) - ggplot2 based implementation of tabplot.
Domain-specific packages
-
RBioPlot (GitHub package) - automated data analysis and visualization for molecular biology. Details can be found in the paper at NCBI.
-
ExPanDaR - package for interactive data visualization. Designed for longitudinal data, but can be also used with other types of data after setting an artificial time variable. Shiny apps with examples are provided on the github website of the package.
-
brolgar (GitHub package) - tools to assist in longitudinal data analysis
-
POMA (Bioconductor package) - structured, reproducible and easy-to-use workflow for the visualization, pre-processing, exploratory data analysis and statistical analysis of mass spectrometry data. POMA R/Shiny version available here.
Related packages
-
featuretoolsR (CRAN package) - R port to Python library for automated feature engineering.
-
vtreat (CRAN package) - data treatment (pre-processing) that includes dealing with missing data and large categorical variables. Details can be found in the paper about vtreat.
-
report - automated modeling report generation.
-
FactoInvestigate (CRAN package) - has an automatic reporting module which selects best plots that summarise different projection techniques.
-
gtsummary (GitHub package) - presentation-ready tables summarizing data sets, regression models, and more.
-
clean (CRAN package) - fast data cleaning.
-
finalfit (CRAN package) - tables and plots to quickly visualize regression results.
-
modelsummary (GitHub package) - summary tables for regression models.
Python libraries
General Packages
-
DataPrep (pip library) - data preparation library with an EDA package. Paper about the package
-
Dora (pip library) - data cleaning, featuring engineering and simple modeling tools.
-
statsModels (pip library) - collection of statistical tools, including EDA.
-
TPOT (pip library) - autoML tool with feature engineering module.
-
HoloViews (pip library) - automated visualization based on short data annotations.
-
lens (pip library) - fast calculation of summary statistics and correlations. Presentation about the library.
-
pandas-profiling - popular library for quick data summaries and correlation analysis.
-
speedML (pip library) - large library for ML with module dedicated to fast EDA.
-
edaviz - Python library for fast data exploration that provides functions for dataset overviews, bivariate plots and finding good predictors. (Free version only works for small datasets).
-
AutoViz - Python library for automated visualization.
-
ExploriPy - Python library for various EDA tasks.
-
pandas-summary - simple extension to pandas.describe.
-
sweetviz - visualizations for automated EDA.
Related packages
-
featuretools - library for automated feature engineering.
-
pyvtreat - Python version of the R's vtreat package.
-
autoimpute - easier handling of missing values.
-
Auto_TS - automated time series modeling.
Stata packages
- eda - a package that produces a pdf report with all permutations of univariate and bivariate visualizations and tables. Notably, three-dimensional displays are also possible.
Web services
-
DIVE - MIT's tools for data exploration that tries to choose best (most informative) visualizations.
-
Automatic Statistician - tool for automated EDA and modeling.
-
Several Shiny apps by R Squared Computing, including visulizer and descriptr.
Standalone software
-
auto-eda - automatic EDA with SQL.
-
elycite - tools for exploration and modelling available (locally) as an web application. Designed for NLP problems.
Papers and short articles
Methods and tools for autoEDA
- Interactive Data Exploration with “Big Data Tukey Plots” - automated visualization of big data.
- Extracting Top-K Insights from Multi-dimensional Data.
- Agency plus Automation: Designing Artificial Intelligence into Interactive Systems
- The Landscape of R Packages for Automated Exploratory Data Analysis
- Issues in Automating Exploratory Data Analysis
- Automating anomaly detection for exploratory data analytics
- Task-Oriented Optimal Sequencing of Visualization Charts
- A Rank-by-Feature Framework for Interactive Exploration of Multidimensional Data - A paper that describe many measures that can be used to sort 1d and 2d data displays.
- Towards a benefit-based optimizer for Interactive Data Analysis
- Spotfire: an information exploration environment
- AlphaClean: Automatic Generation ofData Cleaning Pipelines
- Testing MS Excel's autoEDA tool
Visualization recommendation frameworks
- Foresight: Recommending Visual Insights - Foresight is a system that helps the user rapidly discover visual insights from large high-dimensional datasets.
- DIVE: A Mixed-Initiative System Supporting Integrated Data Exploration Workflows. The web app is available on MIT website.
- Voyager: Exploratory Analysis via Faceted Browsing of Visualization Recommendations.
- Voyager 2: Augmenting Visual Analysis with Partial View Specifications
- VizML: A Machine Learning Approach to Visualization Recommendation
- VizDeck: Streamlining Exploratory Visual Analytics of Scientific Data
- Taggle: Scalable Visualization of Tabular Data through Aggregation
Augmented analytics
- Augmenting Visualizations with Interactive Data Facts to Facilitate Interpretation and Communication.