There are no reviews yet. Be the first to send feedback to the community and the maintainers!
eli5
A library for debugging/inspecting machine learning classifiers and explaining their predictionsscrapy-rotating-proxies
use multiple proxies with Scrapytensorboard_logger
Log TensorBoard events without touching TensorFlowsklearn-crfsuite
scikit-learn inspired API for CRFsuiteaquarium
Splash + HAProxy + Docker Composedeep-deep
Adaptive crawler which uses Reinforcement Learning methodsarachnado
Web Crawling UI and HTTP API, based on Scrapy and Tornadoautologin
A project to attempt to automatically login to a website given a single seedhtml-text
Extract text from HTMLFormasaurus
Formasaurus tells you the type of an HTML form and its fields using machine learningpage-compare
Simple heuristic for measuring web page similarity (& data set)autopager
Detect and classify pagination linksundercrawler
A generic crawlerscrapy-crawl-once
Scrapy middleware which allows to crawl only new contentsoft404
A classifier for detecting soft 404 pagesagnostic
Agnostic Database Migrationsautologin-middleware
Scrapy middleware for the autologinjson-lines
Read JSON lines (jl) files, including gzipped and brokenextract-html-diff
extract difference between two html pagesscrapy-kafka-export
Scrapy extension which writes crawled items to KafkaMaybeDont
A component that tries to avoid downloading duplicate contentsitehound-frontend
Site Hound (previously THH) is a Domain Discovery ToolimageSimilarity
Given a new image, determine if it is likely derived from a known image.domain-discovery-crawler
Broad crawler for domain discoveryurl-summary
Show summary of a large number of URLs in a Jupyter Notebooktor-proxy
a tor socks proxy docker imagescrapy-dockerhub
[UNMAINTAINED] Deploy, run and monitor your Scrapy spiders.web-page-annotator
Annotate parts of web pages in the browserscrash-lua-examples
A collection of example LUA scripts and JS utilitiesscrapy-cdr
Item definition and utils for storing items in CDR format for scrapyhh-page-classifier
Headless Horseman Page Classifier serviceprivoxy
Privoxy HTTP Proxy based on jess/privoxysitehound-backend
Sitehound's backendfortia
[UNMAINTAINED] Firefox addon for Scrapelyproxy-middleware
Scrapy middleware that reads proxy config from settingslinkrot
[UNMAINTAINED] A script (Scrapy spider) to check a list of URLs.hgprofiler
linkdepth
[UNMAINTAINED] scrapy spider to check link depth over timecommon-crawl-mapreduce
A naive scoring of commoncrawl's content using MRcaptcha-broad-crawl
Broad crawl of onion sites in search for captchasfrontera-crawler
Crawler-specific logic for Fronterahh-deep-deep
THH โ deep-deep integrationscrapy-login
[UNMAINTAINED] A middleware that provides continuous site login facilitybk-string
A BK Tree based approach to storing and querying strings by Levenshtein Distance.domainSpider
Simple web crawler that sticks to a set list of domains. Work in progress.quickpin
New iteration of QuickPin with Flask & AngularDartpy-bkstring
A python wrapper for the bk-string C project.broadcrawl
Middleware that limits number of internal/external links during broad crawlsshadduser
A simple tool to add a new user with OpenSSH keys.autoregister
quickpin-api
Python wrapper for the QuickPin APImuricanize
A translation APIrs-bkstring
scrash-pageuploader
[UNMAINTAINED] S3 Uploader pipelines for HTML and screenshots rendered by Splashsite-checker
frontera-scripts
A set of scripts to spin up EC2 Frontera cluster with spidersLove Open Source and this site? Check out how you can help us