• Stars
    star
    168
  • Rank 218,469 (Top 5 %)
  • Language
    Rust
  • Created about 7 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A fast file deduplicator

Dupe krill β€” a fast file deduplicator

Replaces files that have identical content with hardlinks, so that file data of all copies is stored only once, saving disk space. Useful for reducing sizes of multiple backups, messy collections of photos and music, countless copies of node_modules, macOS app bundles, and anything else that's usually immutable (since all hardlinked copies of a file will change when any one of them is changed).

Features

  • It's very fast and reasonably memory-efficient.
  • Deduplicates incrementally as soon as duplicates are found.
  • Replaces files atomically and it's safe to interrupt at any time.
  • Proven to be reliable. Used for years without an issue.
  • It's aware of existing hardlinks and supports merging of multiple groups of hardlinks.
  • Gracefully handles symlinks and special files.

Usage

Download binaries from the releases page.

Works on macOS and Linux. Windows is not supported.

If you have the latest stable Rust (1.42+), build the program with either cargo install dupe-krill or clone this repo and cargo build --release.

dupe-krill -d <files or directories> # find dupes without doing anything
dupe-krill <files or directories> # find and replace with hardlinks

See dupe-krill -h for details.

Output

It prints one duplicate per line. It prints both paths on the same line with the difference between them highlighted as {first => second}.

Progress shows:

<number unique file bodies>+<number of hardlinks> dupes. <files checked>+<files skipped> files scanned.

Symlinks, special device files, and 0-sized files are always skipped.

Don't try to parse program's usual output. Add --json option if you want machine-readable output. You can also use this program as a Rust library for seamless integration.

How does hardlinking work?

Files are deduplicated by making a hardlink. They're not deleted. Instead, litreally the same file will exist in two or more directories at once. Unlike symlinks, the hardlinks behave like real files. Deleting one of hardlinks leaves other hardlinks unchanged. Editing a hardlinked file edits it in all places at once (except in some applications that delete & create a new file, instead of overwriting existing files). Hardlinking will make all duplicates of a file have the same file permissions.

This program will only deduplicate files larger than a single disk block (4KB, usually), because in many filesystems hardlinking tiny files may not actually save space. You can add -s flag to dedupe small files, too.

Nerding out about the fast deduplication algorithm

In short: it uses Rust's standard library BTreeMap for deduplication, but with a twist that allows it to compare files lazily, reading only as little file content as necessary.


Theoretically, you could find all duplicate files by putting them in a giant hash table aggregating file paths and using file content as the key:

HashMap<Vec<u8>, Vec<Path>>

but of course that would use ludicrous amounts of memory. You can fix it by using hashes of the content instead of the content itself.

BTW, I can't stress enough how mind-bogglingly improbable accidental cryptographic hash collisions are. It's not just "you're probably safe if you're lucky". It's "creating this many files would take more energy than our civilisation has ever produced in all of its history".

HashMap<[u8; 16], Vec<Path>>

but that's still pretty slow, since you still read entire content of all the files. You can save some work by comparing file sizes first:

HashMap<u64, HashMap<[u8; 20], Vec<Path>>

but it helps only a little, since files with identical sizes are surprisingly common. You can eliminate a bit more of near-duplicates by comparing only beginnings of the files first:

HashMap<u64, HashMap<[u8; 20], HashMap<[u8; 20], Vec<Path>>>

and then maybe compare only the ends, and maybe a few more fragments in the middle, etc.:

HashMap<u64, HashMap<[u8; 20], HashMap<[u8; 20], HashMap<[u8; 20], Vec<Path>>>>
HashMap<u64, HashMap<[u8; 20], HashMap<[u8; 20], HashMap<[u8; 20], HashMap<[u8; 20], HashMap<[u8; 20], …>>>>

These endlessly nested hashmaps can be generalized. BTreeMap doesn't need to see the whole key at once. It only compares keys with each other, and the comparison can be done incrementally β€” by only reading enough of the file to show that its key is unique, without even knowing the full key.

BTreeMap<LazilyHashing<File>, Vec<Path>>

And that's what this program does (and a bit of wrangling with inodes).

The whole heavy lifting of deduplication is done by Rust's standard library BTreeMap and overloaded </> operators that incrementally hash the files (yes, operator overloading that does file I/O is a brilliant idea. I couldn't use <<, unfortunately).

More Repositories

1

pngquant

Lossy PNG compressor β€” pngquant command based on libimagequant library
C
4,782
star
2

slip

Slip.js β€” UI library for manipulating lists via swipe and drag gestures
JavaScript
2,440
star
3

giflossy

Merged into Gifsicle!
C
968
star
4

dssim

Image similarity comparison simulating human perception (multiscale SSIM in Rust)
Rust
963
star
5

cavif-rs

AVIF image creator in pure Rust
Rust
473
star
6

7z

Because 7-zip source code was in a 7z archive [mirror]
C++
472
star
7

ImageAlpha

Mac GUI for pngquant, pngnq and posterizer
Python
471
star
8

cargo-deb

A cargo subcommand that generates Debian packages from information in Cargo.toml
Rust
267
star
9

http-cache-semantics

RFC 7234 in JavaScript. Parses HTTP headers to correctly compute cacheability of responses, even in complex cases
JavaScript
234
star
10

mediancut-posterizer

Lossy PNG compressor for RGBA PNGs. Has two modes: lossy averaging filter (blurizer) that denoises the image and optimal posterization using Median Cut quantization to reduce number of unique colors in the image with minimal visual distortion
C
231
star
11

pngquant-photoshop

Photoshop plug-in for saving PNG images with pngquant compression
C++
201
star
12

rust-security-framework

Bindings to the macOS Security.framework
Rust
197
star
13

jpeg-compressor

Research JPEG encoder
C++
190
star
14

lodepng-rust

All-in-one PNG image encoder/decoder in pure Rust
Rust
93
star
15

rust-rgb

struct RGB for sharing pixels between crates
Rust
88
star
16

imgref

A trivial Rust struct for interchange of pixel buffers with width, height & stride
Rust
51
star
17

libicns

icns2png / libicns for OS X icns files
C
44
star
18

undither

Smart filter to remove Floyd-Steinberg dithering from paletted images
Rust
43
star
19

Sblam

Server-side HTTP spam filter
PHP
39
star
20

rust-lcms2

ICC color profiles in Rust
Rust
38
star
21

mozjpeg-sys

Rust bindings for mozjpeg
Rust
32
star
22

vpsearch

C library for finding nearest (most similar) element in a set
Rust
30
star
23

objc2grammar

Objective-C 2.0 grammar for SableCC 3 parser. Allows reading of Objective-C source files into abstract syntax tree.
Java
21
star
24

yuv

YCbCr to sRGB converter in Rust
Rust
18
star
25

hCardValidator

hCard Microformat Validator
PHP
17
star
26

image-gif-dispose

Implements GIF disposal method (full rendering of frames) for the Rust gif crate
Rust
17
star
27

rgba-hq2x

hq2x scaling algorithm updated to support RGBA
C++
17
star
28

libimagequant-rust

libimagequant (pngquant) bindings for the Rust language
17
star
29

avif-serialize

Minimal pure Rust AVIF writer (bring your own AV1 payload)
Rust
16
star
30

bcrypt

Fast JavaScript implementation of bCrypt
JavaScript
14
star
31

rust-file

Trivial 1-liner for reading files
Rust
13
star
32

Enterprise

HTML5 Game Jam game
JavaScript
11
star
33

avif-decode

Convert AVIF images to PNG (as lossless as possible)
Rust
11
star
34

mysqlcompat

A reimplemenation of as many MySQL functions as possible in PostgreSQL, as an aid to porting
PLpgSQL
11
star
35

core-services

Rust bindings for CoreServices framework
Rust
10
star
36

openjpeg-sys

Rust bindings for the openjpeg library
Rust
10
star
37

atom2rss

XSL stylesheets for converting Atom 0.3 β†’ Atom 1.0 β†’ RSS 2.0.
XSLT
8
star
38

avif-parse

AVIF parser for extracting AV1 payload from image files. Supports alpha channel association. Fork of Firefox's MP4 parser.
Rust
8
star
39

rust-lcms2-sys

Rust bindings for Little CMS liblcms2
Rust
7
star
40

mss_saliency

Detection of visually salient image regions using Maximum Symmetric Surround algorithm
Rust
7
star
41

libjpeg

The old libjpeg
C
7
star
42

pngoo

Automatically exported from code.google.com/p/pngoo
C#
7
star
43

rust-libpng-sys

Build script to get libpng compile on Windows. It's horrible. Stay away.
Rust
4
star
44

CSS-Preprocessor

DEPRECATED; Preprocessor+parser+minifier
PHP
3
star
45

parallel-progressive

Demo site for HTTP/2-parallelized progressive JPEG
JavaScript
3
star
46

crev-proofs

cargo-crev package reviews
2
star
47

read-through-http-cache

Read-through LRU cache that has basic understanding of HTTP cache headers
JavaScript
2
star
48

itunesfixer

Automatically exported from code.google.com/p/itunesfixer
Objective-C
2
star
49

rust-openh264

Unfinished Rust bindings for Cisco's OpenH264
Rust
2
star
50

nota

Not a pragmatic message format
Rust
1
star
51

picture-element

Simplified <picture> element proposal
1
star
52

cargo-static-registry-rfc-proof-of-concept

Testing whether it's feasible to serve crates-io registry over HTTP as static files
Rust
1
star
53

torrentspotlight

Automatically exported from code.google.com/p/torrentspotlight
Objective-C
1
star
54

is-dark-theme

Hacky check whether macOS is configured to use a Dark Mode appearance
Rust
1
star