• Stars
    star
    139
  • Rank 262,954 (Top 6 %)
  • Language
    Python
  • License
    MIT License
  • Created over 6 years ago
  • Updated almost 5 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Distributed crawler, database and web frontend for public directories indexing

OD-Database

OD-Database is a web-crawling project that aims to index a very large number of file links and their basic metadata from open directories (misconfigured Apache/Nginx/FTP servers, or more often, mirrors of various public services).

Each crawler instance fetches tasks from the central server and pushes the result once completed. A single instance can crawl hundreds of websites at the same time (Both FTP and HTTP(S)) and the central server is capable of ingesting thousands of new documents per second.

The data is indexed into elasticsearch and made available via the web frontend (Currently hosted at https://od-db.the-eye.eu/). There is currently ~1.93 billion files indexed (total of about 300Gb of raw data). The raw data is made available as a CSV file here.

2018-09-20-194116_1127x639_scrot

Contributing

Suggestions/concerns/PRs are welcome

Installation (Docker)

git clone --recursive https://github.com/simon987/od-database
cd od-database
mkdir oddb_pg_data/ tt_pg_data/ es_data/ wsb_data/
docker-compose up

Architecture

diag

Running the crawl server

The python crawler that was a part of this project is discontinued, the go implementation is currently in use.

More Repositories

1

awesome-datahoarding

List of data-hoarding related tools
1,080
star
2

Much-Assembly-Required

Assembly programming game
Java
930
star
3

sist2

Lightning-fast file system indexer and search tool
C
869
star
4

ngx_http_js_challenge_module

Simple javascript proof-of-work based access for Nginx with virtually no overhead. (Similar to Cloudflare's anti-DDoS feature)
C
61
star
5

Architeuthis

MITM HTTP(S) proxy with integrated load-balancing, rate-limiting and error handling. Built for automated web scraping.
Go
41
star
6

opendirectories-bot

Python
33
star
7

Misc-Download-Scripts

Python
27
star
8

fastimagehash

C/C++ replacement for the 'imagehash' python package
C++
19
star
9

Simple-Incremental-Search-Tool

Simple web frontend to an elasticsearch database made for local files indexing
Python
18
star
10

yt-metadata

Script to import youtube-dl metadata to PostgreSQL
Python
14
star
11

Much-Assembly-Required-Frontend

Files for https://muchassemblyrequired.com/ frontend.
JavaScript
10
star
12

beemer

beemer executes a custom command on files written in the watched directory and deletes it.
Go
8
star
13

imagehash-web

Javascript replacement for the ImageHash python package
JavaScript
6
star
14

task_tracker

Fast task tracker (job queue) with authentication, statistics and web frontend
Go
5
star
15

reddit_feed

Fault-tolerant daemon that fetches comments & submissions from reddit
Python
4
star
16

dataarchivist.net

wip
HTML
3
star
17

vanwanet_scrape

Python requests wrapper with VanwaNet DDoS mitigation bypass (similar to cloudflare-scrape)
Python
3
star
18

chan_feed

Daemon that fetches posts from compatible *chan image boards
Python
3
star
19

sist2-script-clip

sist2 user script to generate CLIP embeddings
Python
3
star
20

bingo

wip toy project, please ignore
Python
3
star
21

status

Minimalist status page
Scala
2
star
22

castget-scripts

castget scripts to automate podcast download & transcoding
Python
2
star
23

opendirectories-bot-2

Reddit bot that interfaces with od-database
Python
2
star
24

sist2-scripts

Python
2
star
25

sist2-build-arm64

Docker image to build sist2 (arm64, tested with raspi 4B)
Dockerfile
2
star
26

sist2-script-whisper

Python
2
star
27

sist2-ner-models

NER models for sist2
Python
2
star
28

sist2-python

Set of python tools to interface with sist2 index files. Used in user scripts
Python
1
star
29

sist2-demo

Scripts for sist2 demo website
Shell
1
star
30

sist2-models

Official sist2 machine learning models
Python
1
star
31

scripts

Shell
1
star
32

cbr2cbz

Yet another cbr to cbz converter
C
1
star
33

ws_feed_adapter

Go
1
star
34

manga-dl

Python
1
star
35

hexlib

Misc utility methods in Python
Python
1
star
36

the-rom.eu

don't ask
HTML
1
star
37

nvidia-tf-lab-docker

Jupyterlab image with nvidia/tensorflow + some other packages
Dockerfile
1
star
38

simon987.net

Personal website
JavaScript
1
star
39

sist2-build

Docker image to build sist2
Dockerfile
1
star
40

task_tracker_drone

General purpose 'set and forget' task runner for task_tracker
Python
1
star
41

ffmpeg-thumbnail-size-viz

Python
1
star
42

feed_viz

JavaScript
1
star
43

music-graph-scripts

Utility scripts for music-graph
Python
1
star
44

pg_asciifold

asciifold C-Language function based on Lucene's ASCIIFoldingFilter
C
1
star
45

wacom-config

Config for wacom tablet mapping on linux
Shell
1
star
46

fastimagehash-go

go bindings for libfastimagehash
Go
1
star