• Stars
    star
    5,072
  • Rank 8,203 (Top 0.2 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created over 5 years ago
  • Updated 6 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

😎 Finding duplicate images made easy!

Image Deduplicator (imagededup)

Build Status Docs codecov PyPI Version License

imagededup is a python package that simplifies the task of finding exact and near duplicates in an image collection.

This package provides functionality to make use of hashing algorithms that are particularly good at finding exact duplicates as well as convolutional neural networks which are also adept at finding near duplicates. An evaluation framework is also provided to judge the quality of deduplication for a given dataset.

Following details the functionality provided by the package:

  • Finding duplicates in a directory using one of the following algorithms:
  • Generation of encodings for images using one of the above stated algorithms.
  • Framework to evaluate effectiveness of deduplication given a ground truth mapping.
  • Plotting duplicates found for a given image file.

Detailed documentation for the package can be found at: https://idealo.github.io/imagededup/

imagededup is compatible with Python 3.8+ and runs on Linux, MacOS X and Windows. It is distributed under the Apache 2.0 license.

πŸ“– Contents

βš™οΈ Installation

There are two ways to install imagededup:

  • Install imagededup from PyPI (recommended):
pip install imagededup
  • Install imagededup from the GitHub source:
git clone https://github.com/idealo/imagededup.git
cd imagededup
pip install "cython>=0.29"
python setup.py install

πŸš€ Quick Start

In order to find duplicates in an image directory using perceptual hashing, following workflow can be used:

  • Import perceptual hashing method
from imagededup.methods import PHash
phasher = PHash()
  • Generate encodings for all images in an image directory
encodings = phasher.encode_images(image_dir='path/to/image/directory')
  • Find duplicates using the generated encodings
duplicates = phasher.find_duplicates(encoding_map=encodings)
  • Plot duplicates obtained for a given file (eg: 'ukbench00120.jpg') using the duplicates dictionary
from imagededup.utils import plot_duplicates
plot_duplicates(image_dir='path/to/image/directory',
                duplicate_map=duplicates,
                filename='ukbench00120.jpg')

The output looks as below:

The complete code for the workflow is:

from imagededup.methods import PHash
phasher = PHash()

# Generate encodings for all images in an image directory
encodings = phasher.encode_images(image_dir='path/to/image/directory')

# Find duplicates using the generated encodings
duplicates = phasher.find_duplicates(encoding_map=encodings)

# plot duplicates obtained for a given file using the duplicates dictionary
from imagededup.utils import plot_duplicates
plot_duplicates(image_dir='path/to/image/directory',
                duplicate_map=duplicates,
                filename='ukbench00120.jpg')

It is also possible to use your own custom models for finding duplicates using the CNN method.

For examples, refer this part of the repository.

For more detailed usage of the package functionality, refer: https://idealo.github.io/imagededup/

⏳ Benchmarks

Update: Provided benchmarks are only valid upto imagededup v0.2.2. The next releases have significant changes to all methods, so the current benchmarks may not hold.

Detailed benchmarks on speed and classification metrics for different methods have been provided in the documentation. Generally speaking, following conclusions can be made:

  • CNN works best for near duplicates and datasets containing transformations.
  • All deduplication methods fare well on datasets containing exact duplicates, but Difference hashing is the fastest.

🀝 Contribute

We welcome all kinds of contributions. See the Contribution guide for more details.

πŸ“ Citation

Please cite Imagededup in your publications if this is useful for your research. Here is an example BibTeX entry:

@misc{idealods2019imagededup,
  title={Imagededup},
  author={Tanuj Jain and Christopher Lennan and Zubin John and Dat Tran},
  year={2019},
  howpublished={\url{https://github.com/idealo/imagededup}},
}

πŸ— Maintainers

Β© Copyright

See LICENSE for details.

More Repositories

1

image-super-resolution

πŸ”Ž Super-scale your images and run experiments with Residual Dense and Adversarial Networks.
Python
4,595
star
2

image-quality-assessment

Convolutional Neural Networks to predict the aesthetic and technical quality of images.
Python
2,059
star
3

imageatm

Image classification for everyone.
Python
215
star
4

mongodb-slow-operations-profiler

This java web application collects slow operations from one or multiple mongoDB system(s) in order to visualize and analyze them.
Java
192
star
5

cnn-exposed

πŸ•΅οΈβ€β™‚οΈ Interpreting Convolutional Neural Network (CNN) Results.
Jupyter Notebook
175
star
6

mongodb-performance-test

multithreaded test tool to test mongodb performances, such as throughput and latency
Java
85
star
7

php-rdkafka-ffi

PHP Kafka client - binding librdkafka via FFI
PHP
76
star
8

terraform-aws-opensearch

Terraform module to provision an OpenSearch cluster with SAML authentication.
HCL
67
star
9

nvidia-docker-keras

Workflow that shows how to train neural networks on EC2 instances with GPU support and compares training times to CPUs
Python
60
star
10

falcon-prediction-app

Simple Machine Learning Web API Example with Falcon
Jupyter Notebook
50
star
11

terraform-emr-pyspark

Quickstart PySpark with Anaconda on AWS/EMR using Terraform
HCL
47
star
12

cloudwatch-alarm-to-ms-teams

Send CloudWatch Alarms to Microsoft Teams via an SNS topic.
TypeScript
33
star
13

terraform-aws-mwaa

Terraform module to setup Managed Workflows with Apache Airflow. (Airflow as managed service by AWS)
HCL
32
star
14

php-middleware-stack

Lightweight PHP 7+ middleware stack based on PSR-15 spec
PHP
29
star
15

jenkins-ci

Minimal example to setup a Jenkins-CI pipeline for data science projects on OpenShift in a couple of minutes.
Dockerfile
27
star
16

logback-redis

Logback Redis Appender with Pipeline-Support for maximum throughput
Java
24
star
17

spring-cloud-stream-binder-sqs

Amazon SQS for Spring Cloud Stream
Java
23
star
18

terraform-provider-controltower

Use AWS Control Tower from Terraform
Go
21
star
19

deckard

Easy-to-use Spring Kafka Producers
Java
16
star
20

flask-openshift-example

Simple Flask example using Docker to deploy on OpenShift 3.
Dockerfile
15
star
21

aws-signing-proxy

Golang HTTP Reverse Proxy to transparently sign requests to AWS endpoints
Go
10
star
22

idealo-orders-api-php-sdk

idealo Direktkauf PHP SDK
PHP
9
star
23

spring-cloud-stream-binder-sns

Amazon SNS for Spring Cloud Stream
Java
9
star
24

logstash-logback-http

Logstash Logback HTTP/HTTPS Appender
Java
8
star
25

idealo.design

idealo Design System Catalog hosted on https://idealo.design
JavaScript
6
star
26

spring-endpoint-exporter

A command-line utility that allows you to export all Endpoints of your Spring Boot Application in OpenAPI 3 format by scanning for specific classes in a jar file or on the file system without actually loading them.
Kotlin
6
star
27

aiven-metadata-prometheus-exporter

A prometheus exporter that provides metadata metrics on Aiven's "service" level
Go
5
star
28

idealo.github.io

Landing page for idealo.
JavaScript
3
star
29

setup-aaga-credentials-action

Securely access AWS from GitHub Actions
TypeScript
3
star
30

terraform-provider-csd

Terraform provider for the common domain product
Go
3
star
31

wheelwright

🎑 Automated build repo for Python wheels (based on spaCy's wheelwright repo)
Python
3
star
32

ds-example-project

Simple Python web application using Anaconda as the package manager. It is intended to be used along Jenkins-CI which is deployed on OpenShift.
Python
3
star
33

test-logger

Junit rule to silence logging for specific tests
Java
1
star
34

offerpage-pairing-task

Java
1
star
35

cctray-hub

github actions to cctray proxy
Kotlin
1
star
36

kafka-ex1

Java
1
star
37

spring-endpoint-exporter-action

An action for the Spring Endpoint Exporter that allows you to export all Endpoints of your Spring Boot Application in OpenAPI 3 format by scanning for specific classes in a jar file or on the file system without actually loading them.
Dockerfile
1
star