rmax/scrapy-redis

Stars
5,503
Rank 7,461 (Top 0.2 %)
Language
Python
License
MIT License
Created over 13 years ago
Updated 6 months ago

rmax/scrapy-redis

rmax

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Redis-based components for Scrapy.

Scrapy-Redis

Documentation Status

Coverage Status

Requirements Status

Security Status

Redis-based components for Scrapy.

Usage: https://github.com/rmax/scrapy-redis/wiki/Usage
Documentation: https://github.com/rmax/scrapy-redis/wiki.
Release: https://github.com/rmax/scrapy-redis/wiki/History
Contribution: https://github.com/rmax/scrapy-redis/wiki/Getting-Started
LICENSE: MIT license

Features

Distributed crawling/scraping

You can start multiple spider instances that share a single redis queue. Best suitable for broad multi-domain crawls.
Distributed post-processing

Scraped items gets pushed into a redis queued meaning that you can start as many as needed post-processing processes sharing the items queue.
Scrapy plug-and-play components

Scheduler + Duplication Filter, Item Pipeline, Base Spiders.
In this forked version: added json supported data in Redis
data contains url, `meta` and other optional parameters. meta is a nested json which contains sub-data. this function extract this data and send another FormRequest with url, meta and addition formdata.

For example:
```
{ "url": "https://exaple.com", "meta": {"job-id":"123xsd", "start-date":"dd/mm/yy"}, "url_cookie_key":"fertxsas" }
```
this data can be accessed in scrapy spider through response. like: request.url, request.meta, request.cookies

Note

This features cover the basic case of distributing the workload across multiple workers. If you need more features like URL expiration, advanced URL prioritization, etc., we suggest you to take a look at the Frontera project.

Requirements

Python 3.7+
Redis >= 5.0
Scrapy >= 2.0
redis-py >= 4.0

Installation

From pip

pip install scrapy-redis

From GitHub

git clone https://github.com/darkrho/scrapy-redis.git
cd scrapy-redis
python setup.py install

Note

For using this json supported data feature, please make sure you have not installed the scrapy-redis through pip. If you already did it, you first uninstall that one.

pip uninstall scrapy-redis

Alternative Choice

Frontera is a web crawling framework consisting of crawl frontier, and distribution/scaling primitives, allowing to build a large scale online web crawler.

dirbot-mysql

Scrapy project based on dirbot to show how to use Twisted's adbapi to store the scraped data in MySQL.

scrapy-inline-requests

A decorator to write coroutine-like spider callbacks.

django-dummyimage

Dynamic Dummy Image Generator For Django!

scrapy-boilerplate

Small set of utilities to simplify writing Scrapy spiders.

scrapydo

Crochet-based blocking API for Scrapy.

Jupyter Notebook

databrewer

The missing datasets manager. Like hombrew but for datasets. CLI-tool for search and discover datasets!

databrewer-recipes

DataBrewer Recipes Repository.

django-on-tornado

Run django on tornado webserver

webfaction-stuff

random stuff to manage your own webfaction hosting

parsel-cli

Parsel Command Line Interface

leveldict

LevelDB dict-like wrappers.

cookiecutter-scrapycloud

A bare minimum Scrapy project template ready for Scrapinghub's Scrapy Cloud service.

Facebook-Hacker-Cup-Results

txrho

misc stuff on top twisted/cyclone

Django-Dash-2010

Repository for Django Dash 2010

awesome-codename

Generate awesome codenames

Random-Code

dask-avro

Avro reader for Dask.

mit-ocw-crawler

MIT's OCW Crawler

anaconda-manylinux-builder

Scripts to build manylinux wheels in Travis CI and upload them in Anaconda.org

persistent-homology-examples

Examples of computing the persistent homology of miscellaneous data sets.

yatiri

programming-challenges

My attempt to improve my algorithm skills. Starting from basic.

dask-kafka

Dask-Kafka reader

dotfiles

My dot files. DEPRECATED. Go -> https://github.com/rmax/dotfiles-ng

dockerfiles

Collection of dockerfiles.

scrapy-slidebot

A collection of Spiders to download slides as PDFs from popular sites like slideshare and speakerdeck.

gyst

A pythonic tool to post gists

haanga-benchs

Haanga's benchmarks port over Tornado Framework

scrapyorg-infinit-crawler

rmax.github.io

code-katas

fastavro-codecs

login_signup

friendly login+signup form

lmbot

cookiecutter-datapackage

rmax

ipynb

Assorted collection of iPython notebooks.

django-ipcountry

dask-elasticsearch

An Elasticsearch reader for Dask

python-benchmarks

Assorted python-based benchmarks

binary-repr

Converts integers to binary representation.

django_inline_example

django dynamic inline example

yammh3

Yet another Murmurhash3 bindings.

my-django-project-template

pmwiki-authelgg

omp-thread-count

A small Python module to get the actual number of threads used by OMP via Cython bindings.

zend-ajax-form-test

rho-blogs-crawler

A Scrapy project to export my legacy blogs

dotfiles-ng

YADM-managed dot files