• Stars
    star
    5,503
  • Rank 7,461 (Top 0.2 %)
  • Language
    Python
  • License
    MIT License
  • Created over 13 years ago
  • Updated 6 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Redis-based components for Scrapy.

Scrapy-Redis

Documentation Status Coverage Status Requirements Status Security Status

Redis-based components for Scrapy.

Features

  • Distributed crawling/scraping

    You can start multiple spider instances that share a single redis queue. Best suitable for broad multi-domain crawls.

  • Distributed post-processing

    Scraped items gets pushed into a redis queued meaning that you can start as many as needed post-processing processes sharing the items queue.

  • Scrapy plug-and-play components

    Scheduler + Duplication Filter, Item Pipeline, Base Spiders.

  • In this forked version: added json supported data in Redis

    data contains url, `meta` and other optional parameters. meta is a nested json which contains sub-data. this function extract this data and send another FormRequest with url, meta and addition formdata.

    For example:

    { "url": "https://exaple.com", "meta": {"job-id":"123xsd", "start-date":"dd/mm/yy"}, "url_cookie_key":"fertxsas" }

    this data can be accessed in scrapy spider through response. like: request.url, request.meta, request.cookies

Note

This features cover the basic case of distributing the workload across multiple workers. If you need more features like URL expiration, advanced URL prioritization, etc., we suggest you to take a look at the Frontera project.

Requirements

  • Python 3.7+
  • Redis >= 5.0
  • Scrapy >= 2.0
  • redis-py >= 4.0

Installation

From pip

pip install scrapy-redis

From GitHub

git clone https://github.com/darkrho/scrapy-redis.git
cd scrapy-redis
python setup.py install

Note

For using this json supported data feature, please make sure you have not installed the scrapy-redis through pip. If you already did it, you first uninstall that one.

pip uninstall scrapy-redis

Alternative Choice

Frontera is a web crawling framework consisting of crawl frontier, and distribution/scaling primitives, allowing to build a large scale online web crawler.

More Repositories

1

dirbot-mysql

Scrapy project based on dirbot to show how to use Twisted's adbapi to store the scraped data in MySQL.
Python
117
star
2

scrapy-inline-requests

A decorator to write coroutine-like spider callbacks.
Python
112
star
3

django-dummyimage

Dynamic Dummy Image Generator For Django!
Python
55
star
4

scrapy-boilerplate

Small set of utilities to simplify writing Scrapy spiders.
Python
49
star
5

scrapydo

Crochet-based blocking API for Scrapy.
Jupyter Notebook
46
star
6

databrewer

The missing datasets manager. Like hombrew but for datasets. CLI-tool for search and discover datasets!
Python
41
star
7

databrewer-recipes

DataBrewer Recipes Repository.
Python
21
star
8

django-on-tornado

Run django on tornado webserver
Python
15
star
9

webfaction-stuff

random stuff to manage your own webfaction hosting
Python
9
star
10

parsel-cli

Parsel Command Line Interface
Python
9
star
11

leveldict

LevelDB dict-like wrappers.
Python
7
star
12

cookiecutter-scrapycloud

A bare minimum Scrapy project template ready for Scrapinghub's Scrapy Cloud service.
Python
7
star
13

Facebook-Hacker-Cup-Results

C++
6
star
14

txrho

misc stuff on top twisted/cyclone
Python
6
star
15

Django-Dash-2010

Repository for Django Dash 2010
JavaScript
6
star
16

awesome-codename

Generate awesome codenames
Makefile
5
star
17

Random-Code

Random code
Python
4
star
18

dask-avro

Avro reader for Dask.
Python
4
star
19

mit-ocw-crawler

MIT's OCW Crawler
Python
4
star
20

anaconda-manylinux-builder

Scripts to build manylinux wheels in Travis CI and upload them in Anaconda.org
Shell
3
star
21

persistent-homology-examples

Examples of computing the persistent homology of miscellaneous data sets.
3
star
22

yatiri

Python
3
star
23

programming-challenges

My attempt to improve my algorithm skills. Starting from basic.
C++
3
star
24

dask-kafka

Dask-Kafka reader
Python
2
star
25

dotfiles

My dot files. DEPRECATED. Go -> https://github.com/rmax/dotfiles-ng
Vim Script
2
star
26

dockerfiles

Collection of dockerfiles.
Shell
2
star
27

scrapy-slidebot

A collection of Spiders to download slides as PDFs from popular sites like slideshare and speakerdeck.
Python
2
star
28

gyst

A pythonic tool to post gists
Python
2
star
29

haanga-benchs

Haanga's benchmarks port over Tornado Framework
PHP
2
star
30

scrapyorg-infinit-crawler

Python
1
star
31

rmax.github.io

CSS
1
star
32

code-katas

My code katas
Python
1
star
33

fastavro-codecs

1
star
34

login_signup

friendly login+signup form
JavaScript
1
star
35

lmbot

1
star
36

cookiecutter-datapackage

Makefile
1
star
37

rmax

1
star
38

ipynb

Assorted collection of iPython notebooks.
1
star
39

django-ipcountry

Python
1
star
40

dask-elasticsearch

An Elasticsearch reader for Dask
Python
1
star
41

python-benchmarks

Assorted python-based benchmarks
Python
1
star
42

binary-repr

Converts integers to binary representation.
Python
1
star
43

django_inline_example

django dynamic inline example
Python
1
star
44

yammh3

Yet another Murmurhash3 bindings.
Python
1
star
45

my-django-project-template

CSS
1
star
46

pmwiki-authelgg

PHP
1
star
47

omp-thread-count

A small Python module to get the actual number of threads used by OMP via Cython bindings.
Python
1
star
48

zend-ajax-form-test

PHP
1
star
49

rho-blogs-crawler

A Scrapy project to export my legacy blogs
Python
1
star
50

dotfiles-ng

YADM-managed dot files
Vim Script
1
star