rmax/scrapy-inline-requests

Stars
112
Rank 312,240 (Top 7 %)
Language
Python
License
MIT License
Created almost 13 years ago
Updated almost 2 years ago

rmax/scrapy-inline-requests

rmax

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

A decorator to write coroutine-like spider callbacks.

Scrapy Inline Requests

Documentation Status

Coverage Status

Code Quality Status

Requirements Status

A decorator for writing coroutine-like spider callbacks.

Free software: MIT license
Documentation: https://scrapy-inline-requests.readthedocs.org.
Python versions: 2.7, 3.4+

Quickstart

The spider below shows a simple use case of scraping a page and following a few links:

from inline_requests import inline_requests
from scrapy import Spider, Request

class MySpider(Spider):
    name = 'myspider'
    start_urls = ['http://httpbin.org/html']

    @inline_requests
    def parse(self, response):
        urls = [response.url]
        for i in range(10):
            next_url = response.urljoin('?page=%d' % i)
            try:
                next_resp = yield Request(next_url)
                urls.append(next_resp.url)
            except Exception:
                self.logger.info("Failed request %s", i, exc_info=True)

        yield {'urls': urls}

See the examples/ directory for a more complex spider.

Warning

The generator resumes its execution when a request's response is processed, this means the generator won't be resume after yielding an item or a request with it's own callback.

Known Issues

Middlewares can drop or ignore non-200 status responses causing the callback to not continue its execution. This can be overcome by using the flag handle_httpstatus_all. See the httperror middleware documentation.
High concurrency and large responses can cause higher memory usage.
This decorator assumes your method have the following signature (self, response).
Wrapped requests may not be able to be serialized by persistent backends.
Unless you know what you are doing, the decorated method must be a spider method and return a generator instance.

scrapy-redis

Redis-based components for Scrapy.

dirbot-mysql

Scrapy project based on dirbot to show how to use Twisted's adbapi to store the scraped data in MySQL.

django-dummyimage

Dynamic Dummy Image Generator For Django!

scrapy-boilerplate

Small set of utilities to simplify writing Scrapy spiders.

scrapydo

Crochet-based blocking API for Scrapy.

Jupyter Notebook

databrewer

The missing datasets manager. Like hombrew but for datasets. CLI-tool for search and discover datasets!

databrewer-recipes

DataBrewer Recipes Repository.

django-on-tornado

Run django on tornado webserver

webfaction-stuff

random stuff to manage your own webfaction hosting

parsel-cli

Parsel Command Line Interface

leveldict

LevelDB dict-like wrappers.

cookiecutter-scrapycloud

A bare minimum Scrapy project template ready for Scrapinghub's Scrapy Cloud service.

Facebook-Hacker-Cup-Results

txrho

misc stuff on top twisted/cyclone

Django-Dash-2010

Repository for Django Dash 2010

awesome-codename

Generate awesome codenames

Random-Code

dask-avro

Avro reader for Dask.

mit-ocw-crawler

MIT's OCW Crawler

anaconda-manylinux-builder

Scripts to build manylinux wheels in Travis CI and upload them in Anaconda.org

persistent-homology-examples

Examples of computing the persistent homology of miscellaneous data sets.

yatiri

programming-challenges

My attempt to improve my algorithm skills. Starting from basic.

dask-kafka

Dask-Kafka reader

dotfiles

My dot files. DEPRECATED. Go -> https://github.com/rmax/dotfiles-ng

dockerfiles

Collection of dockerfiles.

scrapy-slidebot

A collection of Spiders to download slides as PDFs from popular sites like slideshare and speakerdeck.

gyst

A pythonic tool to post gists

haanga-benchs

Haanga's benchmarks port over Tornado Framework

scrapyorg-infinit-crawler

rmax.github.io

code-katas

fastavro-codecs

login_signup

friendly login+signup form

lmbot

cookiecutter-datapackage

rmax

ipynb

Assorted collection of iPython notebooks.

django-ipcountry

dask-elasticsearch

An Elasticsearch reader for Dask

python-benchmarks

Assorted python-based benchmarks

binary-repr

Converts integers to binary representation.

django_inline_example

django dynamic inline example

yammh3

Yet another Murmurhash3 bindings.

my-django-project-template

pmwiki-authelgg

omp-thread-count

A small Python module to get the actual number of threads used by OMP via Cython bindings.

zend-ajax-form-test

rho-blogs-crawler

A Scrapy project to export my legacy blogs

dotfiles-ng

YADM-managed dot files