• Stars
    star
    1,080
  • Rank 42,846 (Top 0.9 %)
  • Language
  • Created about 6 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

List of data-hoarding related tools

Awesome-DataHoarding

Awesome

Note: This is only a first draft/brainstorm. I will try to organize the list with more useful sections in the future
Feel free to contribute!

Download utilities

^ back to top ^

Web Archiving

  • ArchiveBox: The open source self-hosted web archive. Takes browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
  • Browsertrix Crawler: Browsertrix Crawler is a simplified (Chrome) browser-based high-fidelity crawling system, designed to run a complex, customizable browser-based crawl in a single Docker container
  • Collect: A server to collect & archive websites that also supports video downloads
  • grab-site: The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
  • Heritrix: Extensible, web-scale, archival-quality web crawler
  • HTTrack: Download a website from the Internet to a local directory
  • wail: Web Archiving Integration Layer: One-Click User Instigated Preservation
  • webrecorder: An integrated platform for creating high-fidelity, ISO-compliant web archives in a user-friendly interface, providing access to archived content, and sharing collections
  • wikiteam: set of tools for archiving wikis

General

  • annie: YouTube-DL alternative written in Golang

  • aria2: A lightweight multi-protocol & multi-source command-line download utility

  • CrowLeer: Powerful C++ web crawler based on libcurl

  • curl: Tool and library for transferring data with URL syntax, supporting many protocols

  • Horahora: Video hosting website and video archival manager for Niconico, Bilibili, and YouTube

  • httpie: A tool similar to curl and wget but designed to be user friendly, useful for web scraping with shell scripts but be aware you're adding a dependency by doing so.

  • news-crawl: Cralwer for news feeds based on StromCrawler that prouduces WARC files

  • Plowshare: Command-line tool to manage file-sharing site

  • Rclone: A command line program to sync files and directories to and from various cloud storage providers

  • rsync: An open source utility that provides fast incremental file transfer

  • Suck-It: Recursively visit and download a website's content to your disk (multi-threaded)

  • wget: Utility for non-interactive download of files from the Web

  • wget2: Successor of GNU Wget, works multi-threaded

  • wpull: Wget-compatible web downloader and crawler

  • you-get: Dumb downloader that scrapes the web

  • ytdl-sub: Automate downloading and metadata generation with YouTubeDL

  • yt-dlp: A fork of YT-DLP that behaves better

Application-specific

  • BBCSoundDownloader: Bulk downloader for BBC's Sound Effects library http://bbcsfx.acropolis.org.uk/
  • ChanThreadWatch: Saves threads from *chan-style boards and checks for updates until the thread dies
  • comics-downloader: Command-line tool to download comicsand manga in pdf/epub/cbz/cbr from supported sites
  • floatplane_ripper: Script to rip all videos from https://floatplane.rip/
  • gallery-dl: Download image galleries and collections from pixiv, exhentai, danbooru and more
  • Discord-Channel-Scraper: Discord server archival (json output, download attachments and emojies)
  • dzi-dl: Deep Zoom Image Downloader
  • FanFicFare: Tool for making eBooks from stories on fanfiction and other web sites
  • FicSave: Online fanfiction downloader Source code is available, website however is now offline.
  • flickr_download: Simple script to download a Flickr set
  • Google Images Download: Python script for downloading images
  • iiif-dl: Command-line tile downloader/assembler for IIIF endpoints/manifests
  • imgbrd-grabber: Very customizable imageboard/booru downloader with powerful filenaming features
  • instaloader: Download pictures (or videos) along with their captions and other metadata from Instagram
  • InstaLooter: API-less Instagram pictures and videos downloader.
  • Instagram Scraper: Instagram-scraper is a command-line application written in Python that scrapes and downloads an instagram user's photos and videos. Use responsibly.
  • PyInstaLive: Instagram live stream downloader
  • RedditDownloader: Scrapes Reddit to download media of your choice
  • Scribd-Downloader: Allows downloading of Scribd documents
  • snscrape: A social networking service scraper in Python
  • RipMe: RipMe is an album ripper for various websites. Runs on your computer. Requires Java 8.
  • Tube Archivist: Self-Hosted Docker container for automated/scheduled YouTube downloads of channels, playlists, etc.
  • tumblr-utils: Utilities for dealing with Tumblr blogs, Tumblr backup
  • yt-mango: YouTube metadata archiver the Web (HTTP & FTP)
  • Youtube-MA: YouTube metadata archiver

Download automation

  • bazarr: Companion application to Sonarr and Radarr for downloading subtitles
  • FlexGet: Multipurpose automation tool for content like torrents, nzbs, podcasts, comics, series, movies, etc.
  • Jackett: API support for torrent trackers (works with Sonarr, Radarr and others)
  • Lidarr: Music collection manager for Usenet and BitTorrent users
  • Mylar: An automated Comic Book downloader (cbr/cbz) for use with SABnzbd, NZBGet and torrents
  • Sick-Beard: PVR for newsgroup users (with limited torrent support)
  • Radarr: A fork of Sonarr to work with movies à la Couchpotato
  • Sonarr: PVR for Usenet and BitTorrent users

Backup

^ back to top ^

  • BorgBackup: Deduplicating archiver with compression and encryption

Compression

^ back to top ^

  • 7-Zip: A file archiver with a high compression ratio
  • KGB Archiver: compression tool with unbelievable high compression rate
  • peazip: File archiver utility
  • PIGZ: Multi-threaded gzip
  • WinRAR: Can decompress RAR and zip files

Network

^ back to top ^

  • NetLimiter: Internet traffic control and monitoring tool for Windows

File systems

^ back to top ^

File conversion

^ back to top ^

  • AAXtoMP3: convert AAX files to common MP3, M4A, M4B, flac and ogg formats through a basic bash script frontend to FFMPEG
  • html2warc: Convert web resources to a single warc file
  • warcat: Tool and library for handling Web ARChive (WARC) files

Utility Scripts

^ back to top ^

Content sharing

^ back to top ^

  • h5ai: HTTP web server index for Apache httpd, lighttpd, nginx and Cherokee
  • ipfs: Protocol and network designed to create a content-addressable, peer-to-peer method of storing and sharing hypermedia in a distributed file system
  • opds: Easy to use, Open & Decentralized Content Distribution
  • Syncthing: An application that lets you synchronize your files across multiple devices

Data curation

^ back to top ^

  • baobab: Graphical disk usage analyzer
  • beets: Music library manager and MusicBrainz tagger
  • browsemonkey: Takes snapshots of file systems for offline browsing and searching.
  • Calibre: Ebook manager
  • DataCurator-Filetree: A unified filetree for all kinds of data, which should help in storing, categorising and retrieving
  • DeepSort: AI powered image tagger backed by DeepDetect
  • diskover: File system crawler, disk space usage, file search engine and file system analytics powered by Elasticsearch
  • Everything: Locate files and folders by name instantly (Windows)
  • FileBot: FileBot is the ultimate tool for organizing and renaming your Movies, TV Shows and Anime
  • fucking-weeb: A library manager for animu (and TV shows, and whatever).
  • grepWin: A powerful and fast search tool using regular expressions (Windows)
  • Hydrus: A desktop application for large media collections
  • Kiwix: An offline reader for online content like Wikipedia, Project Gutenberg, or TED Talks
  • jdupes: Powerful duplicate file finder
  • MediaElch: Media manager for Kodi
  • MediaInfo: Convenient unified display of the most relevant technical and tag data for video and audio files
  • Mp3tag: Powerful and easy-to-use tool to edit metadata of audio files (Windows/Mac)
  • phockup: Media sorting tool to organize photos and videos from your camera
  • picard: MusicBrainz tagger
  • TeraCopy: Copy your files faster and more securely
  • tree: 'tree' command for linux
  • WinDirStat: Disk usage statistics viewer and cleanup tool for Windows
  • WizTree: Finds the files and folders using the most disk space on your hard drive
  • sist2: Lightning-fast file system indexer and search tool
  • SyncToy: Microsoft windows file parity across locations tool
  • VisiPics: Automatically finds duplicated images

APIs & Online tools

^ back to top ^

  • iqdb: Multi-service reverse image search
  • thetvdb: TV shows metadata (used by plex)

Hardware / Monitoring

^ back to top ^

  • CrystalDiskInfo: A HDD/SSD utility software which supports a part of USB, Intel RAID and NVMe
  • GSmartControl: Easy to use Multi-OS S.M.A.R.T. utility with an easy to understand graphical interface
  • Hard Drive Sentinel: Multi-OS SSD and HDD monitoring and analysis software
  • smartmontools: Control and monitor storage systems using the (SMART) built into most modern ATA/SATA, SCSI/SAS and NVMe disks

Data recovery

^ back to top ^

  • PhotoRec FOSS powerful gui data recovery tool
  • TestDisk Another FOSS tool by the author of PhotoRec, but this one is for cli

Local Media

^ back to top ^

  • whipper: Python CD-DA ripper preferring accuracy over speed. Generates .flac, .cue, and .log by default and automatically fetches metadata from musicbrainz. EAC log plugin is available.
  • Exact Audio Copy: A freeware, Windows only application similar to the above that doesn't automatically fetch metadata by default, but EAC rips are preferred by most trackers
  • MakeMKV: A cross-platform DVD ripper that supports recent blu ray DVDs. It's mostly open source, but the blu ray secret sauce is still hidden
  • Handbrake: Open source DVD ripper and media trascoder. Has more options and features than the above, but it cannot rip blu ray discs

Long-term data archiving

^ back to top ^

  • CommonCrawl: Data collected over seven years (ongoing) which contains web page data, extracted metadata and text extractions.
  • Blockyarchive: Archive with forward error correction and sector level recoverability
  • par2cmdline: A PAR 2.0 compatible file verification and repair tool

More Repositories

1

Much-Assembly-Required

Assembly programming game
Java
930
star
2

sist2

Lightning-fast file system indexer and search tool
C
869
star
3

od-database

Distributed crawler, database and web frontend for public directories indexing
Python
139
star
4

ngx_http_js_challenge_module

Simple javascript proof-of-work based access for Nginx with virtually no overhead. (Similar to Cloudflare's anti-DDoS feature)
C
61
star
5

Architeuthis

MITM HTTP(S) proxy with integrated load-balancing, rate-limiting and error handling. Built for automated web scraping.
Go
41
star
6

opendirectories-bot

Python
33
star
7

Misc-Download-Scripts

Python
27
star
8

fastimagehash

C/C++ replacement for the 'imagehash' python package
C++
19
star
9

Simple-Incremental-Search-Tool

Simple web frontend to an elasticsearch database made for local files indexing
Python
18
star
10

yt-metadata

Script to import youtube-dl metadata to PostgreSQL
Python
14
star
11

Much-Assembly-Required-Frontend

Files for https://muchassemblyrequired.com/ frontend.
JavaScript
10
star
12

beemer

beemer executes a custom command on files written in the watched directory and deletes it.
Go
8
star
13

imagehash-web

Javascript replacement for the ImageHash python package
JavaScript
6
star
14

task_tracker

Fast task tracker (job queue) with authentication, statistics and web frontend
Go
5
star
15

reddit_feed

Fault-tolerant daemon that fetches comments & submissions from reddit
Python
4
star
16

dataarchivist.net

wip
HTML
3
star
17

vanwanet_scrape

Python requests wrapper with VanwaNet DDoS mitigation bypass (similar to cloudflare-scrape)
Python
3
star
18

chan_feed

Daemon that fetches posts from compatible *chan image boards
Python
3
star
19

sist2-script-clip

sist2 user script to generate CLIP embeddings
Python
3
star
20

bingo

wip toy project, please ignore
Python
3
star
21

status

Minimalist status page
Scala
2
star
22

castget-scripts

castget scripts to automate podcast download & transcoding
Python
2
star
23

opendirectories-bot-2

Reddit bot that interfaces with od-database
Python
2
star
24

sist2-scripts

Python
2
star
25

sist2-build-arm64

Docker image to build sist2 (arm64, tested with raspi 4B)
Dockerfile
2
star
26

sist2-script-whisper

Python
2
star
27

sist2-ner-models

NER models for sist2
Python
2
star
28

sist2-python

Set of python tools to interface with sist2 index files. Used in user scripts
Python
1
star
29

sist2-demo

Scripts for sist2 demo website
Shell
1
star
30

sist2-models

Official sist2 machine learning models
Python
1
star
31

scripts

Shell
1
star
32

cbr2cbz

Yet another cbr to cbz converter
C
1
star
33

ws_feed_adapter

Go
1
star
34

manga-dl

Python
1
star
35

hexlib

Misc utility methods in Python
Python
1
star
36

the-rom.eu

don't ask
HTML
1
star
37

nvidia-tf-lab-docker

Jupyterlab image with nvidia/tensorflow + some other packages
Dockerfile
1
star
38

simon987.net

Personal website
JavaScript
1
star
39

sist2-build

Docker image to build sist2
Dockerfile
1
star
40

task_tracker_drone

General purpose 'set and forget' task runner for task_tracker
Python
1
star
41

ffmpeg-thumbnail-size-viz

Python
1
star
42

feed_viz

JavaScript
1
star
43

music-graph-scripts

Utility scripts for music-graph
Python
1
star
44

pg_asciifold

asciifold C-Language function based on Lucene's ASCIIFoldingFilter
C
1
star
45

wacom-config

Config for wacom tablet mapping on linux
Shell
1
star
46

fastimagehash-go

go bindings for libfastimagehash
Go
1
star