Discover TeamHG-Memex/site-checker Open Source project

use multiple proxies with Scrapy

2,758

scrapy-rotating-proxies

Log TensorBoard events without touching TensorFlow

656

tensorboard_logger

scikit-learn inspired API for CRFsuite

625

sklearn-crfsuite

Splash + HAProxy + Docker Compose

421

aquarium

Adaptive crawler which uses Reinforcement Learning methods

192

deep-deep

Web Crawling UI and HTTP API, based on Scrapy and Tornado

165

arachnado

A project to attempt to automatically login to a website given a single seed

156

autologin

115

html-text

Extract text from HTML

Formasaurus tells you the type of an HTML form and its fields using machine learning

115

Formasaurus

Simple heuristic for measuring web page similarity (& data set)

110

page-compare

Detect and classify pagination links

autopager

undercrawler

A generic crawler

Scrapy middleware which allows to crawl only new content

scrapy-crawl-once

A classifier for detecting soft 404 pages

soft404

Agnostic Database Migrations

agnostic

Scrapy middleware for the autologin

autologin-middleware

Read JSON lines (jl) files, including gzipped and broken

json-lines

extract difference between two html pages

extract-html-diff

Scrapy extension which writes crawled items to Kafka

scrapy-kafka-export

A component that tries to avoid downloading duplicate content

MaybeDont

Site Hound (previously THH) is a Domain Discovery Tool

sitehound-frontend

Broad crawler for domain discovery

domain-discovery-crawler

Show summary of a large number of URLs in a Jupyter Notebook

url-summary

This is the facade for installation and access to the individual components

sitehound

Shell

tor-proxy

a tor socks proxy docker image

scrapy-dockerhub

[UNMAINTAINED] Deploy, run and monitor your Scrapy spiders.

Annotate parts of web pages in the browser

web-page-annotator

A collection of example LUA scripts and JS utilities

scrash-lua-examples

JavaScript

scrapy-cdr

Item definition and utils for storing items in CDR format for scrapy

Headless Horseman Page Classifier service

hh-page-classifier

Privoxy HTTP Proxy based on jess/privoxy

privoxy

sitehound-backend

Sitehound's backend

[UNMAINTAINED] Firefox addon for Scrapely

fortia

JavaScript

proxy-middleware

Scrapy middleware that reads proxy config from settings

[UNMAINTAINED] A script (Scrapy spider) to check a list of URLs.

linkrot

[UNMAINTAINED] scrapy spider to check link depth over time

hgprofiler

JavaScript

linkdepth

A naive scoring of commoncrawl's content using MR

common-crawl-mapreduce

Java

captcha-broad-crawl

Broad crawl of onion sites in search for captchas

Crawler-specific logic for Frontera

frontera-crawler

THH ↔ deep-deep integration

hh-deep-deep

[UNMAINTAINED] A middleware that provides continuous site login facility

scrapy-login

A BK Tree based approach to storing and querying strings by Levenshtein Distance.

bk-string

domainSpider

Simple web crawler that sticks to a set list of domains. Work in progress.

New iteration of QuickPin with Flask & AngularDart

quickpin

A python wrapper for the bk-string C project.

py-bkstring

Middleware that limits number of internal/external links during broad crawl

broadcrawl

A simple tool to add a new user with OpenSSH keys.

sshadduser

autoregister

Python wrapper for the QuickPin API

quickpin-api

muricanize

rs-bkstring

scrash-pageuploader

[UNMAINTAINED] S3 Uploader pipelines for HTML and screenshots rendered by Splash

A set of scripts to spin up EC2 Frontera cluster with spiders

frontera-scripts