scrapinghub/scrapy-poet

Stars
119
Rank 297,930 (Top 6 %)
Language
Python
License
BSD 3-Clause "New...
Created about 5 years ago
Updated about 1 month ago

scrapinghub/scrapy-poet

scrapinghub

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Page Object pattern for Scrapy

scrapy-poet

Supported Python Versions

Coverage report

Documentation Status

scrapy-poet is the web-poet Page Object pattern implementation for Scrapy. scrapy-poet allows to write spiders where extraction logic is separated from the crawling one. With scrapy-poet is possible to make a single spider that supports many sites with different layouts.

Read the documentation for more information.

License is BSD 3-clause.

Documentation: https://scrapy-poet.readthedocs.io
Source code: https://github.com/scrapinghub/scrapy-poet
Issue tracker: https://github.com/scrapinghub/scrapy-poet/issues

Quick Start

Installation

pip install scrapy-poet

Requires Python 3.8+ and Scrapy >= 2.6.0.

Usage in a Scrapy Project

Add the following inside Scrapy's settings.py file:

DOWNLOADER_MIDDLEWARES = {
    "scrapy_poet.InjectionMiddleware": 543,
}
SPIDER_MIDDLEWARES = {
    "scrapy_poet.RetryMiddleware": 275,
}

Developing

Setup your local Python environment via:

pip install -r requirements-dev.txt
pre-commit install

Now everytime you perform a git commit, these tools will run against the staged files:

black
isort
flake8

You can also directly invoke pre-commit run --all-files or tox -e linters to run them without performing a commit.

portia

Visual scraping for Scrapy

splash

Lightweight, scriptable browser as a service with an HTTP API

dateparser

python parser for human readable dates

frontera

A scalable frontier for web crawlers

slackbot

A chat bot for Slack (https://slack.com).

extruct

Extract embedded metadata from HTML markup

scrapyrt

HTTP API for Scrapy spiders

python-crfsuite

A python binding for crfsuite

spidermon

Scrapy Extension for monitoring spiders execution.

price-parser

Extract price amount and currency symbol from a raw text string

article-extraction-benchmark

Article extraction benchmark: dataset and evaluation scripts

webstruct

NER toolkit for HTML data

python-scrapinghub

A client interface for Scrapinghub's API

adblockparser

Python parser for Adblock Plus filters

js2xml

Convert Javascript code to an XML document

testspiders

Useful test spiders for Scrapy

scrapy-training

Scrapy Training companion code

skinfer

Skinfer is a tool for inferring and merging JSON schemas

sample-projects

Sample projects showcasing Scrapinghub tech

shub

Scrapinghub Command Line Client

python-simhash

An efficient simhash implementation for python

number-parser

Parse numbers written in natural language

mdr

A python library detect and extract listing data from HTML page.

web-poet

Web scraping Page Objects core library

aile

Automatic Item List Extraction

wappalyzer-python

UNMAINTAINED Python wrapper for Wappalyzer (utility that uncovers the technologies used on websites)

pydepta

A python implementation of DEPTA

scrapinghub-stack-scrapy

Software stack with latest Scrapy and updated deps

aduana

Frontera backend to guide a crawl using PageRank, HITS or other ranking algorithms based on the link structure of the web graph, even when making big crawls (one billion pages).

scrapy-autoextract

Zyte Automatic Extraction integration for Scrapy

scrapy-autounit

Automatic unit test generation for Scrapy.

learn.scrapinghub.com

Scrapinghub Learning Center. Report issues in Jira: Report issues in Jira: https://scrapinghub.atlassian.net/projects/WEB

portia2code

arche

Analyze scraped data

scmongo

MongoDB extensions for Scrapy

exporters

Exporters is an extensible export pipeline library that supports filter, transform and several sources and destinations

webpager

Paginating the web

scrapy-frontera

More flexible and featured Frontera scheduler for Scrapy

page_clustering

A simple algorithm for clustering web pages, suitable for crawlers

flatson

Tool to flatten stream of JSON-like objects, configured via schema

scaws

Extensions for using Scrapy on Amazon AWS

docker-images

scrapylib

Collection of Scrapy utilities (extensions, middlewares, pipelines, etc)

pycon-speakers

Speakers Spider (PyCon 2014 sprint)

docker-devpi

pypi caching service using devpi and docker

crawlera-tools

scrapinghub-entrypoint-scrapy

Scrapy entrypoint for Scrapinghub job runner

scrapy-mosquitera

Restrict crawl and scraping scope using matchers.

andi

Library for annotation-based dependency injection

kafka-scanner

High Level Kafka Scanner

autoextract-spiders

Pre-built Scrapy spiders for AutoExtract

python-cld2

Python bindings for CLD2.

product-extraction-benchmark

Jupyter Notebook

python-hubstorage

Deprecated HubStorage client library - please use python-scrapinghub>=1.9.0 instead

shublang

Pluggable DSL that uses pipes to perform a series of linear transformations to extract data

shub-workflow

shubc

Go bindings for Scrapinghub HTTP API and a sweet command line tool for Scrapy Cloud

scrapinghub-stack-portia

Software stack used to run Portia spiders in Scrapinghub cloud

tutorials

pastebin

navscraper

Vanguard ETF NAV scraper

varanus

A command line spider monitoring tool

hcf-backend

Crawl Frontier HCF backend

pydatanyc

autoextract-poet

web-poet definitions for AutoExtract

collection-scanner

HubStorage collection scanner library

locode

adblockgoparser

Golang parser for Adblock Plus filters

autoextract-examples

Jupyter Notebook

webstruct-demo

HTTP demo for https://github.com/scrapinghub/webstruct

shub-image

Deprecated client side tool to prepare docker images to run crawlers in Scrapinghub - please use shub>=2.5.0 instead

docker-cloudera-manager

Run Cloudera Manager in docker

custom-images-examples

Examples of custom images running on Scrapinghub platform

hubstorage-frontera

Hubstorage crawl frontier backend for Frontera

httpation

xpathcsstutorial

[Work in progress] XPath & CSS for web scraping tutorial

Jupyter Notebook

epmdless_dist

egraylog

scrapinghub-conda-recipes

Conda packages for scrapinghub channel

pydaybot

Demo bot for Python Day Uruguay 2011

erl-iputils

jupyterhub-stacks

A docker images for jhub cluster

cld2

Compact Language Detector 2

scrapinghub-stack-hworker

[DEPRECATED] Software stack fully compatible with Scrapy Cloud 1.0

crawlera.com

crawlera.com website

discourse-sso-google

Use Google as Single-Sign-On provider for Discourse

pkg-opengrok

Ubuntu packaging for OpenGrok