Discover scrapinghub/scrapinghub-stack-portia Open Source project

Lightweight, scriptable browser as a service with an HTTP API

8,991

splash

python parser for human readable dates

3,898

dateparser

A scalable frontier for web crawlers

2,525

frontera

A chat bot for Slack (https://slack.com).

1,288

slackbot

Extract embedded metadata from HTML markup

1,263

extruct

HTTP API for Scrapy spiders

832

scrapyrt

A python binding for crfsuite

829

python-crfsuite

Scrapy Extension for monitoring spiders execution.

770

spidermon

Extract price amount and currency symbol from a raw text string

530

price-parser

Article extraction benchmark: dataset and evaluation scripts

307

article-extraction-benchmark

NER toolkit for HTML data

268

webstruct

A client interface for Scrapinghub's API

252

python-scrapinghub

Python parser for Adblock Plus filters

195

adblockparser

Convert Javascript code to an XML document

187

js2xml

Useful test spiders for Scrapy

186

testspiders

Scrapy Training companion code

183

scrapy-training

Skinfer is a tool for inferring and merging JSON schemas

171

skinfer

Sample projects showcasing Scrapinghub tech

140

sample-projects

Scrapinghub Command Line Client

137

shub

An efficient simhash implementation for python

125

python-simhash

122

scrapy-poet

Page Object pattern for Scrapy

Parse numbers written in natural language

119

number-parser

A python library detect and extract listing data from HTML page.

108

mdr

106

web-poet

Web scraping Page Objects core library

Automatic Item List Extraction

aile

UNMAINTAINED Python wrapper for Wappalyzer (utility that uncovers the technologies used on websites)

wappalyzer-python

A python implementation of DEPTA

pydepta

scrapinghub-stack-scrapy

Software stack with latest Scrapy and updated deps

Dockerfile

aduana

Frontera backend to guide a crawl using PageRank, HITS or other ranking algorithms based on the link structure of the web graph, even when making big crawls (one billion pages).

scrapy-autoextract

Zyte Automatic Extraction integration for Scrapy

Automatic unit test generation for Scrapy.

scrapy-autounit

Scrapinghub Learning Center. Report issues in Jira: Report issues in Jira: https://scrapinghub.atlassian.net/projects/WEB

learn.scrapinghub.com

portia2code

arche

scmongo

MongoDB extensions for Scrapy

Exporters is an extensible export pipeline library that supports filter, transform and several sources and destinations

exporters

More flexible and featured Frontera scheduler for Scrapy

webpager

Paginating the web

scrapy-frontera

A simple algorithm for clustering web pages, suitable for crawlers

page_clustering

Tool to flatten stream of JSON-like objects, configured via schema

flatson

Extensions for using Scrapy on Amazon AWS

scaws

Collection of Scrapy utilities (extensions, middlewares, pipelines, etc)

docker-images

Dockerfile

scrapylib

Speakers Spider (PyCon 2014 sprint)

pycon-speakers

pypi caching service using devpi and docker

docker-devpi

Shell

crawlera-tools

Crawlera tools

Scrapy entrypoint for Scrapinghub job runner

scrapinghub-entrypoint-scrapy

Restrict crawl and scraping scope using matchers.

scrapy-mosquitera

Library for annotation-based dependency injection

andi

kafka-scanner

High Level Kafka Scanner

Pre-built Scrapy spiders for AutoExtract

autoextract-spiders

Python bindings for CLD2.

python-cld2

Deprecated HubStorage client library - please use python-scrapinghub>=1.9.0 instead

product-extraction-benchmark

Jupyter Notebook

python-hubstorage

Pluggable DSL that uses pipes to perform a series of linear transformations to extract data

shublang

shub-workflow

Go bindings for Scrapinghub HTTP API and a sweet command line tool for Scrapy Cloud

shubc

tutorials

pastebin

navscraper

Vanguard ETF NAV scraper

A command line spider monitoring tool

varanus

Crawl Frontier HCF backend

hcf-backend

pydatanyc

web-poet definitions for AutoExtract

autoextract-poet

HubStorage collection scanner library

collection-scanner

locode

Golang parser for Adblock Plus filters

adblockgoparser

autoextract-examples

Jupyter Notebook

webstruct-demo

HTTP demo for https://github.com/scrapinghub/webstruct

Deprecated client side tool to prepare docker images to run crawlers in Scrapinghub - please use shub>=2.5.0 instead

shub-image

Run Cloudera Manager in docker

docker-cloudera-manager

Dockerfile

custom-images-examples

Examples of custom images running on Scrapinghub platform

hubstorage-frontera

Hubstorage crawl frontier backend for Frontera

httpation

[Work in progress] XPath & CSS for web scraping tutorial

xpathcsstutorial

Jupyter Notebook

epmdless_dist

egraylog

Conda packages for scrapinghub channel

scrapinghub-conda-recipes

Shell

pydaybot

Demo bot for Python Day Uruguay 2011

erl-iputils

A docker images for jhub cluster

jupyterhub-stacks

Compact Language Detector 2

cld2

C++

scrapinghub-stack-hworker

[DEPRECATED] Software stack fully compatible with Scrapy Cloud 1.0

crawlera.com

crawlera.com website

Use Google as Single-Sign-On provider for Discourse

discourse-sso-google