• Stars
    star
    992
  • Rank 45,844 (Top 1.0 %)
  • Language
    Python
  • License
    GNU Affero Genera...
  • Created over 2 years ago
  • Updated 5 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Automatically find issues in image datasets and practice data-centric computer vision.

Screen Shot 2023-03-10 at 10 23 33 AM

CleanVision automatically detects potential issues in image datasets like images that are: blurry, under/over-exposed, (near) duplicates, etc. This data-centric AI package is a quick first step for any computer vision project to find problems in the dataset, which you want to address before applying machine learning. CleanVision is super simple -- run the same couple lines of Python code to audit any image dataset!

Read the Docs pypi os py_versions codecov Slack Community Twitter Cleanlab Studio

Installation

pip install cleanvision

Quickstart

Download an example dataset (optional). Or just use any collection of image files you have.

wget -nc 'https://cleanlab-public.s3.amazonaws.com/CleanVision/image_files.zip'
  1. Run CleanVision to audit the images.
from cleanvision import Imagelab

# Specify path to folder containing the image files in your dataset
imagelab = Imagelab(data_path="FOLDER_WITH_IMAGES/")

# Automatically check for a predefined list of issues within your dataset
imagelab.find_issues()

# Produce a neat report of the issues found in your dataset
imagelab.report()
  1. CleanVision diagnoses many types of issues, but you can also check for only specific issues.
issue_types = {"dark": {}, "blurry": {}}

imagelab.find_issues(issue_types=issue_types)

# Produce a report with only the specified issue_types
imagelab.report(issue_types=issue_types)

More resources on how to use CleanVision

Clean your data for better Computer Vision

The quality of machine learning models hinges on the quality of the data used to train them, but it is hard to manually identify all of the low-quality data in a big dataset. CleanVision helps you automatically identify common types of data issues lurking in image datasets.

This package currently detects issues in the raw images themselves, making it a useful tool for any computer vision task such as: classification, segmentation, object detection, pose estimation, keypoint detection, generative modeling, etc. To detect issues in the labels of your image data, you can instead use the cleanlab package.

In any collection of image files (most formats supported), CleanVision can detect the following types of issues:

Issue Type Description Issue Key Example
1 Exact Duplicates Images that are identical to each other exact_duplicates
2 Near Duplicates Images that are visually almost identical near_duplicates
3 Blurry Images where details are fuzzy (out of focus) blurry
4 Low Information Images lacking content (little entropy in pixel values) low_information
5 Dark Irregularly dark images (underexposed) dark
6 Light Irregularly bright images (overexposed) light
7 Grayscale Images lacking color grayscale
8 Odd Aspect Ratio Images with an unusual aspect ratio (overly skinny/wide) odd_aspect_ratio
9 Odd Size Images that are abnormally large or small odd_size

This package is still a work in progress, so expect sharp edges. Feel free to submit any found bugs or desired functionality as an issue!

CleanVision supports Linux, macOS, and Windows and runs on Python 3.7+.

Join our community

License

Copyright (c) 2022 Cleanlab Inc.

cleanvision is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

cleanvision is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

See GNU Affero General Public LICENSE for details.

For enterprise teams that want to use CleanVision in production workflows but are unable to open-source their code, you can contact us to discuss alternative commercial licensing: [email protected]

More Repositories

1

cleanlab

The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
Python
9,345
star
2

label-errors

🛠️ Corrected Test Sets for ImageNet, MNIST, CIFAR, Caltech-256, QuickDraw, IMDB, Amazon Reviews, 20News, and AudioSet
181
star
3

examples

Notebooks demonstrating example applications of the cleanlab library
Jupyter Notebook
107
star
4

multiannotator-benchmarks

Benchmarking algorithms for assessing quality of data labeled by multiple annotators
Jupyter Notebook
30
star
5

cleanlab-studio

Client interface for all things Cleanlab Studio
Python
25
star
6

vizzy

Cleanlab Vizzy: illustrating the core ideas behind the Cleanlab algorithm
TypeScript
13
star
7

cleanvision-examples

Notebooks demonstrating example applications of the cleanvision library
Jupyter Notebook
12
star
8

culture

Our company culture
8
star
9

label-error-detection-benchmarks

Jupyter Notebook
7
star
10

cleanlab-tools

Miscellaneous code made available for purposes of education, reproducibility, and transparency
Jupyter Notebook
7
star
11

ood-detection-benchmarks

Evaluation of algorithms to detect out-of-distribution data
Jupyter Notebook
6
star
12

aws-marketplace

Documentation and Example Notebooks for using AWS Marketplace solutions from Cleanlab
Jupyter Notebook
5
star
13

regression-label-error-benchmark

Benchmark algorithms to detect erroneous label values in regression datasets
Python
3
star
14

token-label-error-benchmarks

Benchmarking methods for label error detection in token classification tasks
Jupyter Notebook
3
star
15

cleanlab-studio-tutorials

Automated repo - do not touch
Jupyter Notebook
2
star
16

multilabel-error-detection-benchmarks

Benchmarking label error detection algorithms for multi-label classification
Jupyter Notebook
1
star