• Stars
    star
    300
  • Rank 138,870 (Top 3 %)
  • Language
    Python
  • Created over 6 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Identifying and removing near-duplicate images using perceptual hashing.

imgdupes

imgdupes is a command line tool for checking and deleting near-duplicate images based on perceptual hash from the target directory.

video_capture Images by Caltech 101 dataset that semi-deduped for demonstration.

It is better to pre-deduplicate identical images with fdupes or jdupes in advance.
Then, you can check and delete near-duplicate images using imgdupes with an operation similar to the fdupes command.

For large dataset

It is possible to speed up dedupe process by approximate nearest neighbor search of hamming distance using NGT or hnsw. See Against large dataset section for details.

Install

To install, simply use pip:

$ pip install imgdupes

Usage

The following example is sample command to find sets of near-duplicate images with Hamming distance of phash less than 4 from the target directory.
To search images recursively from the target directory, add -r or --recursive option.

$ imgdupes --recursive target_dir phash 4
target_dir/airplane_0583.jpg
target_dir/airplane_0800.jpg

target_dir/watch_0122.jpg
target_dir/watch_0121.jpg

By default, imgdupes displays a list of duplicate images list and exits.
To display preserve or delete images prompt, use the -d or --delete option.

If you are using iTerm 2, you can display a set of images on the terminal with the -c or --imgcat option.

$ imgdupes --recursive --delete --imgcat 101_ObjectCategories phash 4

The set of images are sorted in ascending order of file size and displayed together with the pixel size of the image, you can choose which image to preserve.

With -N or --noprompt option, you can preserve the first file in each set of duplicates and delete the rest without prompting.

$ imgdupes -rdN 101_ObjectCategories phash 0

To take input from a list of files

Use --files-from or -T option to take input from a list of files.

$ imgdupes -T image_list.txt phash 0

For example, create image_list.txt as below.

101_ObjectCategories/Faces/image_0345.jpg
101_ObjectCategories/Motorbikes/image_0269.jpg
101_ObjectCategories/Motorbikes/image_0735.jpg
101_ObjectCategories/brain/image_0047.jpg
101_ObjectCategories/headphone/image_0034.jpg
101_ObjectCategories/dollar_bill/image_0038.jpg
101_ObjectCategories/ferry/image_0020.jpg
101_ObjectCategories/tick/image_0049.jpg
101_ObjectCategories/Faces_easy/image_0283.jpg
101_ObjectCategories/watch/image_0171.jpg

Find near-duplicated images from an image you specified

Use --query option to specify a query image file.

$ imgdupes --recursive target_dir --query target_dir/airplane_0583.jpg phash 4
Query: sample_airplane.png

target_dir/airplane_0583.jpg
target_dir/airplane_0800.jpg

Against large dataset

imgdupes supports approximate nearest neighbor search of hamming distance using NGT or hnsw.

To dedupe images using NGT, run with --ngt option after installing NGT and python binding.

$ imgdupes -rdc --ngt 101_ObjectCategories phash 4

Notice: --ngt option is enabled by default from version 0.1.0.

For instructions on installing NGT and python binding, see NGT and python NGT.

To dedupe images using hnsw, run with --hnsw option after installing hnsw python binding.

$ imgdupes -rdc --hnsw 101_ObjectCategories phash 4

Fast exact searching

imgdupes supports exact nearest neighbor search of hamming distance using faiss (IndexFlatL2).

To dedupe images using faiss, run with --faiss-flat option after installing faiss python binding.

$ imgdupes -rdc --faiss-flat 101_ObjectCategories phash 4

Using imgdupes without installing it with docker

You can use imgdupes without installing it using a pre-build docker container image.
NGT, hnsw and faiss are already installed in this image.

Place the target directory in the current directory and execute the following command.

$ docker run -it -v $PWD:/app knjcode/imgdupes -rdc target_dir phash 0

When docker run, current directory is mounted inside the container and referenced from imgdupes.

By aliasing the command, you can use imgdupes as installed.

$ alias imgdupes="docker run -it -v $PWD:/app knjcode/imgdupes"
$ imgdupes -rdc target_dir phash 0

To upgrade imgdupes docker image, you can pull the docker image as below.

$ docker pull knjcode/imgdupes

Available hash algorithm

imgdupes uses the ImageHash to calculate perceptual hash (except for phash_org algorithm).

  • ahash: average hashing

  • phash: perception hashing (using only the 8x8 DCT low-frequency values including the first term)

  • dhash: difference hashing

  • whash: wavelet hashing

  • phash_org: perception hashing (fix algorithm from ImageHash implementation)

    using only the 8x8 DCT low-frequency values and excluding the first term since the DC coefficient can be significantly different from the other values and will throw off the average.

Options

-r --recursive

search images recursively from the target directory (default=False)

-d --delete

prompt user for files to preserve and delete (default=False)

-c --imgcat

display duplicate images for iTerm2 (default=False)

-m --summarize

summarize dupe information

-N --noprompt

together with --delete, preserve the first file in each set of duplicates and delete the rest without prompting the user

--query <image filename>

find image files that are duplicated or similar to the specified image file from the target directory

--hash-bits 64

bits of perceptual hash (default=64)

The number of bits specifies the value that is the square of n.
For example, you can specify 64(8^2), 144(12^2), 256(16^2), etc.

--sort <sort_type>

how to sort duplicate image files (default=filesize)

You can specify following types:

  • filesize: sort by filesize in descending order
  • filepath: sort by filepath in ascending order
  • imagesize: sort by pixel width and height in descenging order
  • width: sort by pixel width in descending order
  • height: sort by pixel height in descending order
  • none: do not sort

--reverse

reverse sort order

--num-proc 4

number of hash calculation and ngt processes (default=cpu_count-1)

--log

output logs of duplicate and delete files (default=False)

--no-cache

not create or use image hash cache (default=False)

--no-subdir-warning

stop warnings that appear when similar images are in different subdirectories

--sameline

list each set of matches on a single line

--dry-run

dry run (do not delete any files)

--faiss-flat

use faiss exact search (IndexFlatL2) for calculating Hamming distance between hash of images (default=False)

--faiss-flat-k 20

number of searched objects when using faiss-flat (default=20)

use with imgcat (-c, --imgcat) options

--size 256x256

resize image (default=256x256)

--space 0

space between images (default=0)

--space-color black

space color between images (default=black)

--tile-num 4

horizontal tile number (default=4)

--interpolation INTER_LINEAR

interpolation methods (default=INTER_LINEAR)

You can specify OpenCV interpolation methods: INTER_NEAREST, INTER_LINEAR, INTER_AREA, INTER_CUBIC, INTER_LANCZOS4, etc.

--no-keep-aspect

do not keep aspect when displaying images

ngt options

--ngt

use NGT for calculating Hamming distance between hash of images (default=True)

--ngt-k 20

number of searched objects when using NGT. Increasing this value, improves accuracy and increases computation time. (default=20)

--ngt-epsilon 0.1

search range when using NGT. Increasing this value, improves accuracy and increases computation time. (default=0.1)

--ngt-edges 10

number of initial edges of each node at graph generation time. (default=10)

--ngt-edges-for-search 40

number of edges at search time. (default=40)

hnsw options

--hnsw

use hnsw for calculating Hamming distance between hash of images (default=False)

--hnsw-k 20

number of searched objects when using hnsw. Increasing this value, improves accuracy and increases computation time. (default=20)

--hnsw-ef-construction 100

controls index search speed/build speed tradeoff (default=100)

--hnsw-m 16

m is tightly connected with internal dimensionality of the data stronlgy affects the memory consumption (default=16)

--hnsw-ef 50

controls recall. higher ef leads to better accuracy, but slower search (default=50)

faiss options

--faiss-cuda

uses CUDA enabled device for faster searching (requires faiss-gpu, Nvidia GPU, and CUDA toolkit)
Install: https://github.com/facebookresearch/faiss/blob/master/INSTALL.md
General: https://github.com/facebookresearch/faiss/wiki/Faiss-on-the-GPU

CUDA options

--cuda-device

uses the specific CUDA device passed for CUDA accelerated searches (default=device with lowest load)
NOTE: if the device passed is not found on the system the CUDA device with the lowest load will be used

License

MIT

More Repositories

1

cifar2png

Convert CIFAR-10 and CIFAR-100 datasets into PNG images
Python
125
star
2

mxnet-finetuner

An all-in-one Deep Learning toolkit for image classification to fine-tuning pretrained models using MXNet.
Jupyter Notebook
103
star
3

kaggle-kuzushiji-recognition-2019

Python
26
star
4

filter-response-normalization-layer-pytorch

Unofficial PyTorch implementation of "Filter Response Normalization Layer: Eliminating Batch Dependence in the Training of Deep Neural Networks"
Python
21
star
5

pytorch-finetuner

Python
15
star
6

kaggle-seti-2021

Python
10
star
7

heroku-buildpack-mecab-ipadic-neologd

Shell
8
star
8

slack-inviteviz

Visualizing an invitation graph of your slack team
Ruby
7
star
9

string-codec

Encoder/Decoder of various algorithm
JavaScript
6
star
10

jirou_deep

ラーメン二郎識別bot(@jirou_deep)
Python
6
star
11

imgtile

A script to tile images
Python
4
star
12

rpi-hubot-docker-template

Template of Hubot Docker container on Raspberry Pi
CoffeeScript
3
star
13

hmms

電車やバスが到着するまでの時刻を一覧でカウントダウンするWebアプリ
JavaScript
3
star
14

mecab-api-hiragana

gooラボの「ひらがな化」APIのクローン
Ruby
3
star
15

rpi-ruby

RPi-compatible Docker base image with Ruby
Makefile
2
star
16

face-collector

Collect faces from images and/or videos
Python
2
star
17

nat_editor

RV-S340NEのNAT設定をCLIで
Ruby
2
star
18

smog-prototype

CSS
2
star
19

devo

CoffeeScript
2
star
20

tbla

東武バスOn-Lineのバス現在位置情報をパースしてjsonで返却
Ruby
2
star
21

growthviz

子供の写真から自動的に成長が分かるものを作る
Python
2
star
22

hubot-slack-latex-formula

A hubot script that creates latex formula images and upload it to Slack
CoffeeScript
2
star
23

photoshare

画像ギャラリー作成スクリプト
JavaScript
1
star
24

hubot-slack-ikku

一句
CoffeeScript
1
star
25

tensorflow-mnist-docker

1
star
26

hubot-slack-welcome

CoffeeScript
1
star
27

rpi-redis

RPi-compatible Docker base image with Redis
Shell
1
star
28

clitest

node.js clitool test with power-assert and coffee-script
CoffeeScript
1
star
29

timeliner

This hubot will send all of the messages on Slack in #timeline
CoffeeScript
1
star
30

slack-name-cards-prototype

HTML
1
star
31

docker-ce-debian-arm64

docker-ce debian arm64 deb packages
1
star
32

autometa

Generate various data from Excel spreadsheet
CoffeeScript
1
star
33

rpi-node-armv7

RPi-compatible Docker base image with Node.js ARMv7
Makefile
1
star