• Stars
    star
    420
  • Rank 102,563 (Top 3 %)
  • Language
    Jupyter Notebook
  • License
    MIT License
  • Created about 3 years ago
  • Updated 3 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Data Quality assessment with one line of code

YData Quality

ydata_quality is an open-source python library for assessing Data Quality throughout the multiple stages of a data pipeline development.

A holistic view of the data can only be captured through a look at data from multiple dimensions and ydata_quality evaluates it in a modular way wrapped into a single Data Quality engine. This repository contains the core python source scripts and walkthrough tutorials.

Quickstart

The source code is currently hosted on GitHub at: https://github.com/ydataai/ydata-quality

Binary installers for the latest released version are available at the Python Package Index (PyPI).

pip install ydata-quality

Comprehensive quality check in few lines of code

from ydata_quality import DataQuality
import pandas as pd

#Load in the data
df = pd.read_csv('./datasets/transformed/census_10k.csv')

# create a DataQuality object from the main class that holds all quality modules
dq = DataQuality(df=df)

# run the tests and outputs a summary of the quality tests
results = dq.evaluate()
Warnings:
	TOTAL: 5 warning(s)
	Priority 1: 1 warning(s)
	Priority 2: 4 warning(s)

Priority 1 - heavy impact expected:
	* [DUPLICATES - DUPLICATE COLUMNS] Found 1 columns with exactly the same feature values as other columns.
Priority 2 - usage allowed, limited human intelligibility:
	* [DATA RELATIONS - HIGH COLLINEARITY - NUMERICAL] Found 3 numerical variables with high Variance Inflation Factor (VIF>5.0). The variables listed in results are highly collinear with other variables in the dataset. These will make model explainability harder and potentially give way to issues like overfitting. Depending on your end goal you might want to remove the highest VIF variables.
	* [ERRONEOUS DATA - PREDEFINED ERRONEOUS DATA] Found 1960 ED values in the dataset.
	* [DATA RELATIONS - HIGH COLLINEARITY - CATEGORICAL] Found 10 categorical variables with significant collinearity (p-value < 0.05). The variables listed in results are highly collinear with other variables in the dataset and sorted descending according to propensity. These will make model explainability harder and potentially give way to issues like overfitting. Depending on your end goal you might want to remove variables following the provided order.
	* [DUPLICATES - EXACT DUPLICATES] Found 3 instances with exact duplicate feature values.

On top of the summary, you can retrieve a list of detected warnings for detailed inspection.

# retrieve a list of data quality warnings 
warnings = dq.get_warnings()

Examples

Here you can find walkthrough tutorials and examples to familiarize with different modules of ydata_quality

To dive into any focussed module, and to understand how they work, here are tutorial notebooks:

  1. Bias and Fairness
  2. Data Expectations
  3. Data Relations
  4. Drift Analysis
  5. Duplicates
  6. Labelling: Categoricals and Numericals
  7. Missings
  8. Erroneous Data

Contributing

We are open to collaboration! If you want to start contributing you only need to:

  1. Search for an issue in which you would like to work on. Issues for newcomers are labeled with good first issue.
  2. Create a PR solving the issue.
  3. We would review every PR and either accept or ask for revisions.

You can also join the discussions on our Discord Community and request features/bug fixes by opening issues on our repository.

Support

For support in using this library, please join our Discord server. The Discord community is very friendly and great about quickly answering questions about the use and development of the library. Click here to join our Discord community!

License

GNU General Public License v3.0

About

With ♥️ from YData Development team

More Repositories

1

ydata-profiling

1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.
Python
12,286
star
2

ydata-synthetic

Synthetic data generators for tabular and time-series data
Jupyter Notebook
1,377
star
3

awesome-data-centric-ai

Data Centric AI resources
99
star
4

academy

Tutorials for YData's Fabric platform
Jupyter Notebook
29
star
5

ydata-sdk

Public SDK to interact with the platform, either public or private
Python
8
star
6

ydata-talkdatatome

Make your dataset talk to you. The AI assistant for data preparation.
Python
8
star
7

playbook

7
star
8

aws-asg-tags-lambda

A lambda that extracts the auto scaling groups from the k8s node pools provided by the user and adds the specified tags to those nodes
Swift
5
star
9

go-core

Core and shared code for our go projects
Go
4
star
10

swift-core

Core functionality for Swift projects
Swift
4
star
11

backend-interview-sample

Sample project for backend interview
Go
3
star
12

authentication-service

Handles authentication using OIDC flow
Go
2
star
13

awesome-dev-environment

A small explanation on how to achieve an awesome (or close to) development environment
2
star
14

homebrew-tap

Homebrew taps
Ruby
2
star
15

docker-github-runner

Container for GitHub runner
Dockerfile
2
star
16

update-notion-page

JavaScript
2
star
17

sd-metrics

A repository that collects different metrics evaluate the quality of synthetic data under the scope data democratization. The metrics evaluate the quality of the synthetic data under the following pillars: utility, fidelity and privacy.
2
star
18

helm-chart

Helm Chart base for use as submodule or as template
Smarty
1
star
19

opensource-template

Template for open source projets
1
star
20

dask-okteto-getting-started

Python
1
star