• Stars
    star
    267
  • Rank 153,621 (Top 4 %)
  • Language
    Python
  • Created over 8 years ago
  • Updated about 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Scrapy spider middleware to ignore requests to pages containing items seen in previous crawls

scrapy-deltafetch

Downloads count

This is a Scrapy spider middleware to ignore requests to pages seen in previous crawls of the same spider, thus producing a "delta crawl" containing only new requests.

This also speeds up the crawl, by reducing the number of requests that need to be crawled, and processed (typically, item requests are the most CPU intensive).

DeltaFetch middleware uses Python's dbm package to store requests fingerprints.

Installation

Install scrapy-deltafetch using pip:

$ pip install scrapy-deltafetch

Configuration

  1. Add DeltaFetch middleware by including it in SPIDER_MIDDLEWARES in your settings.py file:

    SPIDER_MIDDLEWARES = {
        'scrapy_deltafetch.DeltaFetch': 100,
    }
    

    Here, priority 100 is just an example. Set its value depending on other middlewares you may have enabled already.

  2. Enable the middleware using DELTAFETCH_ENABLED in your settings.py:

    DELTAFETCH_ENABLED = True
    

Usage

Following are the different options to control DeltaFetch middleware behavior.

Supported Scrapy settings

  • DELTAFETCH_ENABLED β€” to enable (or disable) this extension
  • DELTAFETCH_DIR β€” directory where to store state
  • DELTAFETCH_RESET β€” reset the state, clearing out all seen requests

These usually go in your Scrapy project's settings.py.

Supported Scrapy spider arguments

  • deltafetch_reset β€” same effect as DELTAFETCH_RESET setting

Example:

$ scrapy crawl example -a deltafetch_reset=1

Supported Scrapy request meta keys

  • deltafetch_key β€” used to define the lookup key for that request. by default it's Scrapy's default Request fingerprint function, but it can be changed to contain an item id, for example. This requires support from the spider, but makes the extension more efficient for sites that many URLs for the same item.
  • deltafetch_enabled - if set to False it will disable deltafetch for some specific request

More Repositories

1

scrapy-splash

Scrapy+Splash for JavaScript integration
Python
3,148
star
2

scrapy-playwright

🎭 Playwright integration for Scrapy
Python
1,002
star
3

scrapy-djangoitem

Scrapy extension to write scraped items using Django models
Python
498
star
4

scrapy-zyte-smartproxy

Zyte Smart Proxy Manager (formerly Crawlera) middleware for Scrapy
Python
356
star
5

scrapy-jsonrpc

Scrapy extension to control spiders using JSON-RPC
Python
297
star
6

scrapy-magicfields

Scrapy middleware to add extra fields to items, like timestamp, response fields, spider attributes etc.
Python
56
star
7

scrapy-jsonschema

Scrapy schema validation pipeline and Item builder using JSON Schema
Python
44
star
8

scrapy-monkeylearn

A Scrapy pipeline to categorize items using MonkeyLearn
Python
37
star
9

scrapy-zyte-api

Zyte API integration for Scrapy
Python
35
star
10

scrapy-headless

Python
29
star
11

scrapy-pagestorage

A scrapy extension to store requests and responses information in storage service
Python
26
star
12

scrapy-querycleaner

Scrapy spider middleware to clean up query parameters in request URLs
Python
25
star
13

scrapy-splitvariants

Scrapy spider middleware to split an item into multiple items using a multi-valued key
Python
20
star
14

scrapy-streaming

Python
18
star
15

scrapy-dotpersistence

A scrapy extension to sync `.scrapy` folder to an S3 bucket
Python
16
star
16

scrapy-streamitem

Scrapy support for working with streamcorpus Stream Items.
Python
11
star
17

scrapy-crawlera-fetch

Scrapy Downloader Middleware for Crawlera Fetch API
Python
8
star
18

scrapy-feedexporter-sftp

Python
6
star
19

scrapy-statsd

Python
6
star
20

scrapy-bigml

Scrapy pipeline for writing items to BigML datasets
Python
4
star
21

scrapy-spider-metadata

Python
4
star
22

scrapy-hcf

Scrapy spider middleware to use Scrapinghub's Hub Crawl Frontier as a backend for URLs
Python
4
star
23

scrapy-snowflake-stage-exporter

Snowflake database loading utility with Scrapy integration
Python
4
star
24

scrapy-feedexporter-google-drive

Python
3
star
25

scrapy-feedexporter-azure-storage

Python
2
star
26

scrapy-feedexporter-onedrive

Export to OneDrive
Python
1
star
27

scrapy-incremental

Python
1
star
28

scrapy-feedexporter-dropbox

Scrapy feed exporter for Dropbox
Python
1
star
29

scrapy-feedexporter-google-sheets

Python
1
star