• Stars
    star
    111
  • Rank 314,510 (Top 7 %)
  • Language
    Python
  • License
    GNU Affero Genera...
  • Created over 6 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

CLI utility to find near duplicate images and remove all but the best copy.

py-image-dedup Build Status Code Climate PyPI version

py-image-dedup is a tool to sort out or remove duplicates within a photo library. Unlike most other solutions, py-image-dedup intentionally uses an approximate image comparison to also detect duplicates of images that slightly differ in resolution, color or other minor details.

It is build upon Image-Match a very popular library to compute a pHash for an image and store the result in an ElasticSearch backend for very high scalability.

asciicast

How it works

Phase 1 - Database cleanup

In the first phase the elasticsearch backend is checked against the current filesystem state, cleaning up database entries of files that no longer exist. This will speed up queries made later on.

Phase 2 - Counting files

Although not necessary for the deduplication process it is very convenient to have some kind of progress indication while the deduplication process is at work. To be able to provide that, available files must be counted beforehand.

Phase 3 - Analysing files

In this phase every image file is analysed. This means generating a signature (pHash) to quickly compare it to other images and adding other metadata of the image to the elasticsearch backend that is used in the next phase.

This phase is quite CPU intensive and the first run take take quite some time. Using as much threads as feasible (using the -t parameter) is advised to get the best performance.

Since we might already have a previous version of this file in the database before analysing a given file the file modification time is compared to the given one. If the database content seems to be still correct the signature for this file will not be recalculated. Because of this, subsequent runs will be much faster. There still has to happen some file access though, so it is probably limited by that.

Phase 4 - Finding duplicates

Every file is now processed again - but only by means of querying the database backend for similar images (within the given max_dist). If there are images found that match the similarity criteria they are considered duplicate candidates. All candidates are then ordered according to the prioritization_rules, which you can specify yourself in the configuration, see Configuration.

If you do not specify prioritization_rules yourself, the following order will be used:

  1. pixel count (more is better)
  2. EXIF data (more exif data is better)
  3. file size (bigger is better)
  4. file modification time (newer is better)
  5. distance (lower is better)
  6. filename contains "copy" (False is better)
  7. filename length (longer is better) - (for "edited" versions)
  8. parent folder path length (shorter is better)
  9. score (higher is better)

The first candidate in the resulting list is considered to be the best available version of all candidates.

Phase 5 - Moving/Deleting duplicates

All but the best version of duplicate candidates identified in the previous phase are now deleted from the file system (if you didn't specify --dry-run of course).

If duplicates_target_directory is set, the specified folder will be used as a root directory to move duplicates to, instead of deleting them, replicating their original folder structure.

Phase 6 - Removing empty folders (Optional)

In the last phase, folders that are empty due to the deduplication process are deleted, cleaning up the directory structure (if turned on in configuration).

How to use

Install

Install py-image-dedup using pip:

pip3 install py-image-dedup

Configuration

py-image-dedup uses container-app-conf to provide configuration via a YAML file as well as ENV variables which generates a reference config on startup. Have a look at the documentation about it.

See py_image_dedup_reference.yaml for an example in this repo.

Setup elasticsearch backend

Since this library is based on Image-Match you need a running elasticsearch instance for efficient storing and querying of image signatures.

Elasticsearch version

This library requires elasticsearch version 5 or later. Sadly the Image-Match library still specifies version 2, so a fork of the original project is used instead. This fork is maintained by me, and any contributions are very much appreciated.

Set up the index

py-image-dedup uses a single index (called images by default). When configured, this index will be created automatically for you.

Command line usage

py-image-dedup can be used from the command line like this:

py-image-dedup deduplicate --help

Have a look at the help output to see how you can customize it.

Daemon

CAUTION! This feature is still very much a work in progress. Always have a backup of your data!

py-image-dedup has a built in daemon that allows you to continuously monitor your source directories and deduplicate them on the fly.

When running the daemon (and enabled in configuration) a prometheus reporter is used to allow you to gather some statistical insights.

py-image-dedup daemon

Dry run

To analyze images and get an overview of what images would be deleted be sure to make a dry run first.

py-image-dedup deduplicate --dry-run

FreeBSD

If you want to run this on a FreeBSD host make sure you have an up to date release that is able to install ports.

Since Image-Match does a lot of math it relies on numpy and scipy. To get those working on FreeBSD you have to install them as a port:

pkg install pkgconf
pkg install py38-numpy
pkg install py27-scipy

For .png support you also need to install

pkg install png

I still ran into issues after installing all these and just threw those two in the mix and it finally worked:

pkg install freetype
pkg install py27-matplotlib  # this has a LOT of dependencies

Encoding issues

When using the python library click on FreeBSD you might run into encoding issues. To mitigate this change your locale from ANSII to UTF-8 if possible.

This can be achieved f.ex. by creating a file ~/.login_conf with the following content:

me:\
	:charset=ISO-8859-1:\
	:lang=de_DE.UTF-8:

Docker

To run py-image-dedup using docker you can use the markusressel/py-image-dedup image from DockerHub:

sudo docker run -t \
    -p 8000:8000 \
    -v /where/the/original/photolibrary/is/located:/data/in \
    -v /where/duplicates/should/be/moved/to:/data/out \
    -e PY_IMAGE_DEDUP_DRY_RUN=False \
    -e PY_IMAGE_DEDUP_ANALYSIS_SOURCE_DIRECTORIES=/data/in/ \
    -e PY_IMAGE_DEDUP_ANALYSIS_RECURSIVE=True \
    -e PY_IMAGE_DEDUP_ANALYSIS_ACROSS_DIRS=True \
    -e PY_IMAGE_DEDUP_ANALYSIS_FILE_EXTENSIONS=.png,.jpg,.jpeg \
    -e PY_IMAGE_DEDUP_ANALYSIS_THREADS=8 \
    -e PY_IMAGE_DEDUP_ANALYSIS_USE_EXIF_DATA=True \
    -e PY_IMAGE_DEDUP_DEDUPLICATION_DUPLICATES_TARGET_DIRECTORY=/data/out/ \
    -e PY_IMAGE_DEDUP_ELASTICSEARCH_AUTO_CREATE_INDEX=True \
    -e PY_IMAGE_DEDUP_ELASTICSEARCH_HOST=elasticsearch \
    -e PY_IMAGE_DEDUP_ELASTICSEARCH_PORT=9200 \
    -e PY_IMAGE_DEDUP_ELASTICSEARCH_INDEX=images \
    -e PY_IMAGE_DEDUP_ELASTICSEARCH_AUTO_CREATE_INDEX=True \
    -e PY_IMAGE_DEDUP_ELASTICSEARCH_MAX_DISTANCE=0.1 \
    -e PY_IMAGE_DEDUP_REMOVE_EMPTY_FOLDERS=False \
    -e PY_IMAGE_DEDUP_STATS_ENABLED=True \
    -e PY_IMAGE_DEDUP_STATS_PORT=8000 \
    markusressel/py-image-dedup:latest

Since an elasticsearch instance is required too, you can also use the docker-compose.yml file included in this repo which will set up a single-node elasticsearch cluster too:

sudo docker-compose up

UID and GID

To run py-image-dedup inside the container using a specific user id and group id you can use the env variables PUID=1000 and PGID=1000.

Contributing

GitHub is for social coding: if you want to write code, I encourage contributions through pull requests from forks of this repository. Create GitHub tickets for bugs and new features and comment on the ones that you are interested in.

License

py-image-dedup by Markus Ressel
Copyright (C) 2018  Markus Ressel

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program.  If not, see <http://www.gnu.org/licenses/>.

More Repositories

1

fan2go

A simple daemon providing dynamic fan speed control based on temperature sensors.
Go
138
star
2

zfs-inplace-rebalancing

Simple bash script to rebalance pool data between all mirrors when adding vdevs to a pool.
Shell
129
star
3

KodeEditor

A simple code editor with syntax highlighting and pinch to zoom
Kotlin
71
star
4

ESPHome-Smart-Scale

An ESPHome based Smart Scale.
46
star
5

barcode-server

A simple daemon to expose USB Barcode Scanner data to other services using Websockets, Webhooks or MQTT.
Python
46
star
6

ESPHome-Analog-Clock

ESPHome configuration example to create an animated clock using the Neopixel 60 LED ring
C++
20
star
7

KodeHighlighter

Simple, extendable code highlighting for Spannables on Android.
Kotlin
16
star
8

sunix-ledstrip-controller-client

A python library for the Sunix WiFi RGBW LED strip controller (HF-LPB100 chipset)
Python
15
star
9

cli2telegram

Small utility to send Telegram messages from the CLI.
Python
15
star
10

MkDocs-Material-Dark-Theme

A dark theme for the mkdocs-material theme
HTML
11
star
11

KutePreferences

A beautiful, clean and extendable preferences library for Android written in Kotlin
Kotlin
10
star
12

PageIndicatorView

A small, simple, animated page indicator without the need for a viewpager.
Kotlin
9
star
13

openhasp-config-manager

A tool to manage all of your openHASP device configs in a centralized place.
Python
8
star
14

telegram-click

Click inspired command-line interface creation toolkit for python-telegram-bot
Python
7
star
15

keel-telegram-bot

A telegram bot for https://keel.sh/
Python
7
star
16

container-app-conf

Convenient configuration of containerized applications
Python
5
star
17

polybar-addons

A selection of utility programs for displaying stuff in polybar
Go
5
star
18

gopass-chrome-importer

Python tool to import passwords from chrome into gopass
Python
5
star
19

raspyrfm-client

A python library to send rc signals with the RaspyRFM module
Python
4
star
20

grocy-telegram-bot

A telegram bot to interact with Grocy.
Python
4
star
21

travis-telegram-bot

A travis config that can be used to send Telegram messages on new builds
Shell
4
star
22

DataMunch

Android App for managing FreeNAS
Kotlin
3
star
23

Watchface-No.-2

A dot styled watchface for the Pebble platform
C
3
star
24

TutorialTooltip

A simple and easy way to add targeted tutorial messages to your app.
Kotlin
3
star
25

telegram-click-aio

Click inspired command-line interface creation toolkit for aiogram
Python
3
star
26

AlienFXAmbilight

Ambilight for Alienware M14x Laptop
C#
3
star
27

blog

My personal blog
Vue
2
star
28

py-range-parse

Parses commonly used range notations to python objects
Python
2
star
29

unity-udp-networking-sample

Simple UDP networking for simple interaction between games.
C#
2
star
30

xs1-api-client

A python library for accessing the EZcontrolยฎ XS1 Gateway API
Python
2
star
31

freenas-api-client

Easy to use FreeNAS Api client
Kotlin
2
star
32

sunix-controller-hass-component

Home Assistant custom component for the Sunix WiFi RGBW controller
Python
2
star
33

venv-install

Install python packages to independent venvs and still use them from your cli as usual.
Shell
1
star
34

Watchface-No.-1

A simple, modular watchface for the Pebble platform.
C
1
star
35

commons

A collection of commonly useful things
Kotlin
1
star
36

DeineMudda

Nee, deine Mudda!
Python
1
star