• Stars
    star
    186
  • Rank 207,316 (Top 5 %)
  • Language
    Python
  • License
    MIT License
  • Created almost 12 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Detects bit rotten files on the hard drive to save your precious photo and music collection from slow decay.

bitrot

Detects bit rotten files on the hard drive to save your precious photo and music collection from slow decay.

Usage

Go to the desired directory and simply invoke:

$ bitrot

This will start digging through your directory structure recursively indexing all files found. The index is stored in a .bitrot.db file which is a SQLite 3 database.

Next time you run bitrot it will add new files and update the index for files with a changed modification date. Most importantly however, it will report all errors, e.g. files that changed on the hard drive but still have the same modification date.

All paths stored in .bitrot.db are relative so it's safe to rescan a folder after moving it to another drive. Just remember to move it in a way that doesn't touch modification dates. Otherwise the checksum database is useless.

Performance

Obviously depends on how fast the underlying drive is. Historically the script was single-threaded because back in 2013 checksum calculations on a single core still outran typical drives, including the mobile SSDs of the day. In 2020 this is no longer the case so the script now uses a process pool to calculate SHA1 hashes and perform stat() calls.

No rigorous performance tests have been done. Scanning a ~1000 file directory totalling ~5 GB takes 2.2s on a 2018 MacBook Pro 15" with a AP0512M SSD. Back in 2013, that same feat on a 2015 MacBook Air with a SM0256G SSD took over 20 seconds.

On that same 2018 MacBook Pro 15", scanning a 60+ GB music library takes 24 seconds. Back in 2013, with a typical 5400 RPM laptop hard drive it took around 15 minutes. How times have changed!

Tests

There's a simple but comprehensive test scenario using pytest and pytest-order.

Install:

$ python3 -m venv .venv
$ . .venv/bin/activate
(.venv)$ pip install -e .[test]

Run:

(.venv)$ pytest -x
==================== test session starts ====================
platform darwin -- Python 3.10.12, pytest-7.4.0, pluggy-1.2.0
rootdir: /Users/ambv/Documents/Python/bitrot
plugins: order-1.1.0
collected 12 items

tests/test_bitrot.py ............                      [100%]

==================== 12 passed in 15.05s ====================

Change Log

1.0.1

  • officially remove Python 2 support that was broken since 1.0.0 anyway; now the package works with Python 3.8+ because of a few features

1.0.0

  • significantly sped up execution on solid state drives by using a process pool executor to calculate SHA1 hashes and perform stat() calls; use -w1 if your runs on slow magnetic drives were negatively affected by this change
  • sped up execution by pre-loading all SQLite-stored hashes to memory and doing comparisons using Python sets
  • all UTF-8 filenames are now normalized to NFKD in the database to enable cross-operating system checks
  • the SQLite database is now vacuumed to minimize its size
  • bugfix: additional Python 3 fixes when Unicode names were encountered

0.9.2

  • bugfix: one place in the code incorrectly hardcoded UTF-8 as the filesystem encoding

0.9.1

  • bugfix: print the path that failed to decode with FSENCODING
  • bugfix: when using -q, don't hide warnings about files that can't be statted or read
  • bugfix: -s is no longer broken on Python 3

0.9.0

  • bugfix: bitrot.db checksum checking messages now obey --quiet
  • Python 3 compatibility

0.8.0

  • bitrot now keeps track of its own database's bitrot by storing a checksum of .bitrot.db in .bitrot.sha512
  • bugfix: now properly uses the filesystem encoding to decode file names for use with the .bitrotdb database. Report and original patch by pallinger.

0.7.1

  • bugfix: SHA1 computation now works correctly on Windows; previously opened files in text-mode. This fix will change hashes of files containing some specific bytes like 0x1A.

0.7.0

  • when a file changes or is renamed, the timestamp of the last check is updated, too
  • bugfix: files that disappeared during the run are now properly ignored
  • bugfix: files that are locked or with otherwise denied access are skipped. If they were read before, they will be considered "missing" in the report.
  • bugfix: if there are multiple files with the same content in the scanned directory tree, renames are now handled properly for them
  • refactored some horrible code to be a little less horrible

0.6.0

  • more control over performance with --commit-interval and --chunk-size command-line arguments
  • bugfix: symbolic links are now properly skipped (or can be followed if --follow-links is passed)
  • bugfix: files that cannot be opened are now gracefully skipped
  • bugfix: fixed a rare division by zero when run in an empty directory

0.5.1

  • bugfix: warn about test mode only in test mode

0.5.0

  • --test command-line argument for testing the state without updating the database on disk (works for testing databases you don't have write access to)
  • size of the data read is reported upon finish
  • minor performance updates

0.4.0

  • renames are now reported as such
  • all non-regular files (e.g. symbolic links, pipes, sockets) are now skipped
  • progress presented in percentage

0.3.0

  • --sum command-line argument for easy comparison of multiple databases

0.2.1

  • fixed regression from 0.2.0 where new files caused a KeyError exception

0.2.0

  • --verbose and --quiet command-line arguments
  • if a file is no longer there, its entry is removed from the database

0.1.0

  • First published version.

Authors

Glued together by Łukasz Langa. Multiple improvements by Ben Shepherd, Jean-Louis Fuchs, Marcus Linderoth, p1r473, Peter Hofmann, Phil Lundrigan, Reid Williams, Stan Senotrusov, Yang Zhang, and Zhuoyun Wei.

More Repositories

1

retype

Re-apply type annotations from .pyi stubs to your codebase.
Python
134
star
2

aiotone

A demo of using AsyncIO for music sequencing
Python
105
star
3

flake8-mypy

A plugin for flake8 integrating Mypy.
Python
102
star
4

dj.choices

An enum implementation for Django forms and models
Python
30
star
5

commonplace

A basic Python-based publishing platform based around the idea of commonplace books
Python
29
star
6

kitdjango

lck.django: a reusable library of typical Django routines, apps, filters, template tags and configuration techniques.
Python
26
star
7

httproxy

Tiny HTTP proxy based on work by Suzuki Hisao and Mitko Haralanov
Python
24
star
8

fm-demo

FM synthesis in Python from scratch, accompanying my PyCon US 2021 talk
Python
23
star
9

dj.chain

An object that enables chaining multiple iterables to serve them lazily as a queryset-compatible object.
Python
21
star
10

.dot_files

My distributed configuration
Shell
17
star
11

aioecho

A non-broken echo protocol example for asyncio
Python
16
star
12

static-annotations

PEP 563: Postponed Evaluation of Annotations
15
star
13

cpython-stats

Gathering stats about python/cpython
Python
15
star
14

singledispatch

Implements PEP443. This is a synchronized copy from Bitbucket kept for Travis support.
Python
11
star
15

rename

Renames files using regular expression matching. This enables elegant handling of multiple renames using a single command.
Python
7
star
16

gha-issuenumber

A GitHub action that enforces all commits have issue numbers linked
Python
6
star
17

requests-testadapter

An adapter for unit testing with requests
Python
5
star
18

requests-robotstxt

Support for robots.txt in a requests Session
Python
4
star
19

kitpy

lck.common: A library of various simple common routines that keep being rewritten all over again in every project we're working on.
Python
4
star
20

kiti18n

lck.i18n: a reusable library of typical i18n routines.
Python
4
star
21

oblique

Shows koans from Oblique Strategies.
Python
3
star
22

benchmark-annotations

A very simple benchmark of `from __future__ import annotations`
Python
3
star
23

spectro

Creates a spectrogram PNG suited for synchronized Full HD display.
Python
3
star
24

modalcommands

Custom commands for vscode-modaledit
TypeScript
2
star
25

django-crystal-big

Everaldo's Crystal icons bundled for direct consumption from Django applications
Python
2
star
26

django-crystal-small

Everaldo's Crystal icons bundled for direct consumption from Django applications. Sizes up to 48x48.
Python
2
star
27

gol

Game of Life talk material from EuroPython 2023
Python
2
star
28

casecheck

Lists all paths that would clash on a case-insensitive filesystem.
Python
1
star
29

thingsweforget-fetcher

Synchronization of images at thingsweforget.blogspot.com for offline use
Python
1
star
30

mudmafia

Managing extortion.
Python
1
star
31

zamawiacz

Simplistic application for managing orders in a shop. Made for a specific client, not meant to be generic.
Python
1
star
32

null

Implements the null object pattern. This is a synchronized copy from Bitbucket kept for Travis support.
Python
1
star
33

dj

This is just a namespace package for Django-related packages. Feel free to use it.
Python
1
star