• Stars
    star
    420
  • Rank 103,194 (Top 3 %)
  • Language
    HTML
  • License
    Creative Commons ...
  • Created over 5 years ago
  • Updated 9 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A list of software and papers related to automatic and fast Exploratory Data Analysis

autoEDA-resources

A list of software and papers related to automated Exploratory Data Analysis, including

  • fast data exploration and visualization,
  • augmented analytics,
  • visualization recommendation and other tools that speed up data exploration (visual exploration in particular).

Pull requests with software, paper and conference presentations are welcome.

Software

R packages

My summary of R packages is in R Journal

General Packages

  • dataMaid (CRAN package) - automated checks of data validity.

  • DataExplorer (CRAN package) - automated data exploration (including univariate and bivariate plots, PCA) and treatment.

  • funModeling (CRAN package) - automated EDA, simple feature engineering and outlier detection.

  • SmartEDA (CRAN package) - automated generation of descriptive statistics and uni- and bivariate plots, parallel coordinate plots. Details can be found in a dedicated paper.

  • autoEDA (GitHub package) - automated EDA with uni- and bivariate plots. An article with an introduction can be found on LinkedIn.

  • visdat (CRAN package) - 6 exploratory/diagnostic plots for initial data analysis.

  • dlookr (CRAN package) - tools for data quality diagnosis, basic exploration and feature transformations.

  • xray (CRAN package) - first look at the data - distributions and anomalies. More in the blog post.

  • arsenal (CRAN package) - statistical summaries (models and exploration) and quick reporting.

  • RtutoR (CRAN package) - learning material with a automatic reports module. More at R-Bloggers.

  • exploreR (CRAN package) - exploration based on univariate linear regression.

  • summarytools (CRAN package) - table to summarise datasets and perform simple uni- and bivariate analyses.

  • inspectdf (CRAN package) - tools for column-wise exploration and comparison of data frames. Examples are provided in a README of the GitHub repo.

  • explore (CRAN package) - interactive Shiny app for comprehensive dataset exploration (including uni- and bivariate relationships, correlation analysis and simple modeling with decision trees) and stand-alone function for quick exploration. Examples are given in a vignette.

  • skimr (CRAN package) - well formatted summaries of data frames, vector and matrices. Examples are provided in a vignette.

  • janitor (CRAN package) - a tools for fast data cleaning. All functionalities are introduced in the vignette.

  • autoplotly (CRAN package) - a library for fast visualization of statistical results supported by ggfortify. Details can be found in the vignette or JOSS paper

  • brinton (CRAN package) - packages for quick exploration and visualization. Details can be found in the documentation.

  • AEDA (GitHub package) - summary statistics, correlation analysis, cluster analysis, PCA & other projections.

  • automatic-data-explorer (GitHub package) - basic EDA and creating Markdown reports from multiple R scripts.

  • xda (GitHub package) - basic data summaries.

  • modeler (GitHub package) - tools for exploration and pre-processing.

  • IEDA (GitHub package) - EDA simplified through interactive visualization.

  • dfvis (GitHub package) - ggplot2 based implementation of tabplot.

Domain-specific packages

Related packages

  • featuretoolsR (CRAN package) - R port to Python library for automated feature engineering.

  • vtreat (CRAN package) - data treatment (pre-processing) that includes dealing with missing data and large categorical variables. Details can be found in the paper about vtreat.

  • report - automated modeling report generation.

  • FactoInvestigate (CRAN package) - has an automatic reporting module which selects best plots that summarise different projection techniques.

  • gtsummary (GitHub package) - presentation-ready tables summarizing data sets, regression models, and more.

  • clean (CRAN package) - fast data cleaning.

  • finalfit (CRAN package) - tables and plots to quickly visualize regression results.

  • modelsummary (GitHub package) - summary tables for regression models.

Python libraries

General Packages

  • DataPrep (pip library) - data preparation library with an EDA package. Paper about the package

  • Dora (pip library) - data cleaning, featuring engineering and simple modeling tools.

  • statsModels (pip library) - collection of statistical tools, including EDA.

  • TPOT (pip library) - autoML tool with feature engineering module.

  • HoloViews (pip library) - automated visualization based on short data annotations.

  • lens (pip library) - fast calculation of summary statistics and correlations. Presentation about the library.

  • pandas-profiling - popular library for quick data summaries and correlation analysis.

  • speedML (pip library) - large library for ML with module dedicated to fast EDA.

  • edaviz - Python library for fast data exploration that provides functions for dataset overviews, bivariate plots and finding good predictors. (Free version only works for small datasets).

  • AutoViz - Python library for automated visualization.

  • ExploriPy - Python library for various EDA tasks.

  • pandas-summary - simple extension to pandas.describe.

  • sweetviz - visualizations for automated EDA.

Related packages

  • featuretools - library for automated feature engineering.

  • pyvtreat - Python version of the R's vtreat package.

  • autoimpute - easier handling of missing values.

  • Auto_TS - automated time series modeling.

Stata packages

  • eda - a package that produces a pdf report with all permutations of univariate and bivariate visualizations and tables. Notably, three-dimensional displays are also possible.

Web services

  • DIVE - MIT's tools for data exploration that tries to choose best (most informative) visualizations.

  • Automatic Statistician - tool for automated EDA and modeling.

  • Several Shiny apps by R Squared Computing, including visulizer and descriptr.

Standalone software

  • auto-eda - automatic EDA with SQL.

  • elycite - tools for exploration and modelling available (locally) as an web application. Designed for NLP problems.

Papers and short articles

Methods and tools for autoEDA

Visualization recommendation frameworks

Augmented analytics

Conference presentations