• Stars
    star
    1
  • Language
    JavaScript
  • Created over 9 years ago
  • Updated about 9 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

More Repositories

1

eli5

A library for debugging/inspecting machine learning classifiers and explaining their predictions
Jupyter Notebook
2,758
star
2

scrapy-rotating-proxies

use multiple proxies with Scrapy
Python
656
star
3

tensorboard_logger

Log TensorBoard events without touching TensorFlow
Python
625
star
4

sklearn-crfsuite

scikit-learn inspired API for CRFsuite
Python
421
star
5

aquarium

Splash + HAProxy + Docker Compose
Python
192
star
6

deep-deep

Adaptive crawler which uses Reinforcement Learning methods
Jupyter Notebook
165
star
7

arachnado

Web Crawling UI and HTTP API, based on Scrapy and Tornado
Python
156
star
8

autologin

A project to attempt to automatically login to a website given a single seed
Python
115
star
9

html-text

Extract text from HTML
HTML
115
star
10

Formasaurus

Formasaurus tells you the type of an HTML form and its fields using machine learning
HTML
110
star
11

page-compare

Simple heuristic for measuring web page similarity (& data set)
HTML
88
star
12

autopager

Detect and classify pagination links
HTML
86
star
13

undercrawler

A generic crawler
Python
75
star
14

scrapy-crawl-once

Scrapy middleware which allows to crawl only new content
Python
74
star
15

soft404

A classifier for detecting soft 404 pages
Jupyter Notebook
53
star
16

agnostic

Agnostic Database Migrations
Python
51
star
17

autologin-middleware

Scrapy middleware for the autologin
Python
37
star
18

json-lines

Read JSON lines (jl) files, including gzipped and broken
Python
34
star
19

extract-html-diff

extract difference between two html pages
HTML
29
star
20

scrapy-kafka-export

Scrapy extension which writes crawled items to Kafka
Python
28
star
21

MaybeDont

A component that tries to avoid downloading duplicate content
Python
27
star
22

sitehound-frontend

Site Hound (previously THH) is a Domain Discovery Tool
HTML
23
star
23

imageSimilarity

Given a new image, determine if it is likely derived from a known image.
Python
20
star
24

domain-discovery-crawler

Broad crawler for domain discovery
Python
17
star
25

url-summary

Show summary of a large number of URLs in a Jupyter Notebook
Python
17
star
26

sitehound

This is the facade for installation and access to the individual components
Shell
16
star
27

tor-proxy

a tor socks proxy docker image
11
star
28

scrapy-dockerhub

[UNMAINTAINED] Deploy, run and monitor your Scrapy spiders.
Python
10
star
29

web-page-annotator

Annotate parts of web pages in the browser
Python
9
star
30

scrash-lua-examples

A collection of example LUA scripts and JS utilities
JavaScript
7
star
31

scrapy-cdr

Item definition and utils for storing items in CDR format for scrapy
Python
7
star
32

hh-page-classifier

Headless Horseman Page Classifier service
Python
6
star
33

privoxy

Privoxy HTTP Proxy based on jess/privoxy
6
star
34

sitehound-backend

Sitehound's backend
HTML
6
star
35

fortia

[UNMAINTAINED] Firefox addon for Scrapely
JavaScript
5
star
36

proxy-middleware

Scrapy middleware that reads proxy config from settings
Python
5
star
37

linkrot

[UNMAINTAINED] A script (Scrapy spider) to check a list of URLs.
Jupyter Notebook
4
star
38

hgprofiler

JavaScript
4
star
39

linkdepth

[UNMAINTAINED] scrapy spider to check link depth over time
Python
4
star
40

common-crawl-mapreduce

A naive scoring of commoncrawl's content using MR
Java
3
star
41

captcha-broad-crawl

Broad crawl of onion sites in search for captchas
Python
3
star
42

frontera-crawler

Crawler-specific logic for Frontera
Python
3
star
43

hh-deep-deep

THH ↔ deep-deep integration
Python
3
star
44

scrapy-login

[UNMAINTAINED] A middleware that provides continuous site login facility
Python
3
star
45

bk-string

A BK Tree based approach to storing and querying strings by Levenshtein Distance.
C
3
star
46

domainSpider

Simple web crawler that sticks to a set list of domains. Work in progress.
Python
3
star
47

quickpin

New iteration of QuickPin with Flask & AngularDart
Python
2
star
48

py-bkstring

A python wrapper for the bk-string C project.
Python
2
star
49

broadcrawl

Middleware that limits number of internal/external links during broad crawl
Python
2
star
50

sshadduser

A simple tool to add a new user with OpenSSH keys.
Python
2
star
51

autoregister

Python
2
star
52

quickpin-api

Python wrapper for the QuickPin API
Python
1
star
53

muricanize

A translation API
Python
1
star
54

rs-bkstring

Rust
1
star
55

scrash-pageuploader

[UNMAINTAINED] S3 Uploader pipelines for HTML and screenshots rendered by Splash
Python
1
star
56

frontera-scripts

A set of scripts to spin up EC2 Frontera cluster with spiders
Python
1
star