imgdupes
imgdupes
is a command line tool for checking and deleting near-duplicate images based on perceptual hash from the target directory.
Images by Caltech 101 dataset that semi-deduped for demonstration.
It is better to pre-deduplicate identical images with fdupes
or jdupes
in advance.
Then, you can check and delete near-duplicate images using imgdupes
with an operation similar to the fdupes
command.
For large dataset
It is possible to speed up dedupe process by approximate nearest neighbor search of hamming distance using NGT or hnsw. See Against large dataset section for details.
Install
To install, simply use pip:
$ pip install imgdupes
Usage
The following example is sample command to find sets of near-duplicate images with Hamming distance of phash less than 4 from the target directory.
To search images recursively from the target directory, add -r
or --recursive
option.
$ imgdupes --recursive target_dir phash 4
target_dir/airplane_0583.jpg
target_dir/airplane_0800.jpg
target_dir/watch_0122.jpg
target_dir/watch_0121.jpg
By default, imgdupes displays a list of duplicate images list and exits.
To display preserve or delete images prompt, use the -d
or --delete
option.
If you are using iTerm 2, you can display a set of images on the terminal with the -c
or --imgcat
option.
$ imgdupes --recursive --delete --imgcat 101_ObjectCategories phash 4
The set of images are sorted in ascending order of file size and displayed together with the pixel size of the image, you can choose which image to preserve.
With -N
or --noprompt
option, you can preserve the first file in each set of duplicates and delete the rest without prompting.
$ imgdupes -rdN 101_ObjectCategories phash 0
To take input from a list of files
Use --files-from
or -T
option to take input from a list of files.
$ imgdupes -T image_list.txt phash 0
For example, create image_list.txt
as below.
101_ObjectCategories/Faces/image_0345.jpg
101_ObjectCategories/Motorbikes/image_0269.jpg
101_ObjectCategories/Motorbikes/image_0735.jpg
101_ObjectCategories/brain/image_0047.jpg
101_ObjectCategories/headphone/image_0034.jpg
101_ObjectCategories/dollar_bill/image_0038.jpg
101_ObjectCategories/ferry/image_0020.jpg
101_ObjectCategories/tick/image_0049.jpg
101_ObjectCategories/Faces_easy/image_0283.jpg
101_ObjectCategories/watch/image_0171.jpg
Find near-duplicated images from an image you specified
Use --query
option to specify a query image file.
$ imgdupes --recursive target_dir --query target_dir/airplane_0583.jpg phash 4
Query: sample_airplane.png
target_dir/airplane_0583.jpg
target_dir/airplane_0800.jpg
Against large dataset
imgdupes
supports approximate nearest neighbor search of hamming distance using NGT or hnsw.
To dedupe images using NGT, run with --ngt
option after installing NGT and python binding.
$ imgdupes -rdc --ngt 101_ObjectCategories phash 4
Notice: --ngt
option is enabled by default from version 0.1.0.
For instructions on installing NGT and python binding, see NGT and python NGT.
To dedupe images using hnsw, run with --hnsw
option after installing hnsw python binding.
$ imgdupes -rdc --hnsw 101_ObjectCategories phash 4
Fast exact searching
imgdupes
supports exact nearest neighbor search of hamming distance using faiss (IndexFlatL2).
To dedupe images using faiss, run with --faiss-flat
option after installing faiss python binding.
$ imgdupes -rdc --faiss-flat 101_ObjectCategories phash 4
Using imgdupes without installing it with docker
You can use imgdupes
without installing it using a pre-build docker container image.
NGT, hnsw and faiss are already installed in this image.
Place the target directory in the current directory and execute the following command.
$ docker run -it -v $PWD:/app knjcode/imgdupes -rdc target_dir phash 0
When docker run, current directory is mounted inside the container and referenced from imgdupes.
By aliasing the command, you can use imgdupes
as installed.
$ alias imgdupes="docker run -it -v $PWD:/app knjcode/imgdupes"
$ imgdupes -rdc target_dir phash 0
To upgrade imgdupes docker image, you can pull the docker image as below.
$ docker pull knjcode/imgdupes
Available hash algorithm
imgdupes
uses the ImageHash to calculate perceptual hash (except for phash_org
algorithm).
-
ahash: average hashing
-
phash: perception hashing (using only the 8x8 DCT low-frequency values including the first term)
-
dhash: difference hashing
-
whash: wavelet hashing
-
phash_org: perception hashing (fix algorithm from ImageHash implementation)
using only the 8x8 DCT low-frequency values and excluding the first term since the DC coefficient can be significantly different from the other values and will throw off the average.
Options
-r
--recursive
search images recursively from the target directory (default=False)
-d
--delete
prompt user for files to preserve and delete (default=False)
-c
--imgcat
display duplicate images for iTerm2 (default=False)
-m
--summarize
summarize dupe information
-N
--noprompt
together with --delete
, preserve the first file in each set of duplicates and delete the rest without prompting the user
--query <image filename>
find image files that are duplicated or similar to the specified image file from the target directory
--hash-bits 64
bits of perceptual hash (default=64)
The number of bits specifies the value that is the square of n.
For example, you can specify 64(8^2), 144(12^2), 256(16^2), etc.
--sort <sort_type>
how to sort duplicate image files (default=filesize)
You can specify following types:
filesize
: sort by filesize in descending orderfilepath
: sort by filepath in ascending orderimagesize
: sort by pixel width and height in descenging orderwidth
: sort by pixel width in descending orderheight
: sort by pixel height in descending ordernone
: do not sort
--reverse
reverse sort order
--num-proc 4
number of hash calculation and ngt processes (default=cpu_count-1)
--log
output logs of duplicate and delete files (default=False)
--no-cache
not create or use image hash cache (default=False)
--no-subdir-warning
stop warnings that appear when similar images are in different subdirectories
--sameline
list each set of matches on a single line
--dry-run
dry run (do not delete any files)
--faiss-flat
use faiss exact search (IndexFlatL2) for calculating Hamming distance between hash of images (default=False)
--faiss-flat-k 20
number of searched objects when using faiss-flat (default=20)
use with imgcat (-c
, --imgcat
) options
--size 256x256
resize image (default=256x256)
--space 0
space between images (default=0)
--space-color black
space color between images (default=black)
--tile-num 4
horizontal tile number (default=4)
--interpolation INTER_LINEAR
interpolation methods (default=INTER_LINEAR)
You can specify OpenCV interpolation methods: INTER_NEAREST, INTER_LINEAR, INTER_AREA, INTER_CUBIC, INTER_LANCZOS4, etc.
--no-keep-aspect
do not keep aspect when displaying images
ngt options
--ngt
use NGT for calculating Hamming distance between hash of images (default=True)
--ngt-k 20
number of searched objects when using NGT. Increasing this value, improves accuracy and increases computation time. (default=20)
--ngt-epsilon 0.1
search range when using NGT. Increasing this value, improves accuracy and increases computation time. (default=0.1)
--ngt-edges 10
number of initial edges of each node at graph generation time. (default=10)
--ngt-edges-for-search 40
number of edges at search time. (default=40)
hnsw options
--hnsw
use hnsw for calculating Hamming distance between hash of images (default=False)
--hnsw-k 20
number of searched objects when using hnsw. Increasing this value, improves accuracy and increases computation time. (default=20)
--hnsw-ef-construction 100
controls index search speed/build speed tradeoff (default=100)
--hnsw-m 16
m is tightly connected with internal dimensionality of the data stronlgy affects the memory consumption (default=16)
--hnsw-ef 50
controls recall. higher ef leads to better accuracy, but slower search (default=50)
faiss options
--faiss-cuda
uses CUDA enabled device for faster searching (requires faiss-gpu, Nvidia GPU, and CUDA toolkit)
Install: https://github.com/facebookresearch/faiss/blob/master/INSTALL.md
General: https://github.com/facebookresearch/faiss/wiki/Faiss-on-the-GPU
CUDA options
--cuda-device
uses the specific CUDA device passed for CUDA accelerated searches (default=device with lowest load)
NOTE: if the device passed is not found on the system the CUDA device with the lowest load will be used
License
MIT