• Stars
    star
    357
  • Rank 119,149 (Top 3 %)
  • Language
    Python
  • License
    Other
  • Created almost 12 years ago
  • Updated over 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

MongoDB pipeline for Scrapy. This module supports both MongoDB in standalone setups and replica sets. scrapy-mongodb will insert the items to MongoDB as soon as your spider finds data to extract.

PyPI version Build Status

scrapy-mongodb

MongoDB pipeline for Scrapy. This library supports both MongoDB in standalone setups and replica sets. It will insert items to MongoDB as soon as your spider finds data to extract. scrapy-mongodb can also buffer objects if you prefer to write chunks of data to MongoDB rather than one write per document (see MONGODB_BUFFER_DATA option for details).

INSTALLATION

Dependencies

Read more here.

Instructions

Install via pip:

pip install -r requirements.txt
pip install scrapy-mongodb

CONFIGURATION

Basic configuration

Add these options to settings.py file:

ITEM_PIPELINES = {
    ...
    'scrapy_mongodb.MongoDBPipeline': 300,
    ...
}

MONGODB_URI = 'mongodb://localhost:27017'
MONGODB_DATABASE = 'scrapy'
MONGODB_COLLECTION = 'my_items'

If you want a unique key in your database, name the key with this option:

MONGODB_UNIQUE_KEY = 'url'

Replica sets

You can configure scrapy-mongodb to support MongoDB replica sets by adding MONGODB_REPLICA_SET option and specify additional replica set hosts in MONGODB_URI:

MONGODB_REPLICA_SET = 'myReplicaSetName'
MONGODB_URI = 'mongodb://host1.example.com:27017,host2.example.com:27017,host3.example.com:27017'

If you need to ensure that your data has been replicated, use the MONGODB_REPLICA_SET_W option. It is an implementation of the w parameter in pymongo. Details from the pymongo documentation:

Write operations will block until they have been replicated to the specified number or tagged set of servers. w=<int> always includes the replica set primary (e.g. w=3 means write to the primary and wait until replicated to two secondaries). Passing w=0 disables write acknowledgement and all other write concern options.

Data buffering

To ease the load on MongoDB, scrapy-mongodb has a buffering feature. You can enable it by setting MONGODB_BUFFER_DATA to the buffer size you want. E.g: scrapy-mongodb will write 10 documents at a time to the database if you set:

MONGODB_BUFFER_DATA = 10

It is not possible to combine this feature with MONGODB_UNIQUE_KEY. Technically due to that the update method in pymongo doesn't support multi-doc updates.

Timestamps

scrapy-mongodb can append a timestamp to your item when inserting it to the database. Enable this feature with:

MONGODB_ADD_TIMESTAMP = True

This will modify the document to something like this:

{
    ...
    'scrapy-mongodb': {
        'ts': ISODate("2013-01-10T07:43:56.797Z")
    }
    ...
}

The timestamp is in UTC.

One collection per spider

It's possible to write data to 1 collection per spider. To enable that feature, set this environment variable:

MONGODB_SEPARATE_COLLECTIONS = True

Full list of available options

Parameter Default Required? Description
MONGODB_DATABASE scrapy-mongodb No Database to use. Does not need to exist.
MONGODB_COLLECTION items No Collection within the database to use. Does not need to exist.
MONGODB_URI mongodb://localhost:27017 No URI to the MongoDB instance or replica sets you want to connect to. It must start with mongodb:// (see more in the MongoDB docs). E.g.: mongodb://user:pass@host:port, mongodb://user:pass@host:port,host2:port2
MONGODB_UNIQUE_KEY None No If you want to have a unique key in the database, enter the key name here. scrapy-mongodb will ensure the key is properly indexed.
MONGODB_BUFFER_DATA None No To ease the load on MongoDB, set this option to the number of items you want to buffer in the client before sending them to database. Setting a MONGODB_UNIQUE_KEY together with MONGODB_BUFFER_DATA is not supported.
MONGODB_ADD_TIMESTAMP False No If set to True, scrapy-mongodb will add a timestamp key to the documents.
MONGODB_FSYNC False No If set to True, it forces MongoDB to wait for all files to be synced before returning.
MONGODB_REPLICA_SET None Yes, for replica sets Set this if you want to enable replica set support. The option should be given the name of the replica sets you want to connect to. MONGODB_URI should point at your config servers.
MONGODB_REPLICA_SET_W 0 No Best described in the pymongo docs. Write operations will block until they have been replicated to the specified number or tagged set of servers. w=<int> always includes the replica set primary (e.g. w=3 means write to the primary and wait until replicated to two secondaries). Passing w=0 disables write acknowledgement and all other write concern options.
MONGODB_STOP_ON_DUPLICATE 0 No Set this to a value greater than 0 to close the spider when that number of duplicated insertions in MongoDB are detected. If set to 0, this option has no effect.

Deprecated options

Since scrapy-mongodb 0.5.0

Parameter Default Required? Description
MONGODB_HOST localhost No MongoDB host name to connect to. Use MONGODB_URI instead.
MONGODB_PORT 27017 No MongoDB port number to connect to. Use MONGODB_URI instead.
MONGODB_REPLICA_SET_HOSTS None No Host string to use to connect to the replica set. See the hosts_or_uri option in the pymongo docs. Use MONGODB_URI instead.

PUBLISHING TO PYPI

make release

CHANGELOG

Read more here.

AUTHOR

This project is maintained by: Sebastian Dahlgren (GitHub | Twitter | LinkedIn).

LICENSE

Read more here.

More Repositories

1

dynamic-dynamodb

Dynamic DynamoDB provides auto scaling for AWS DynamoDB
Python
617
star
2

vim-delve

Neovim / Vim integration for Delve
Vim Script
498
star
3

goldie

Golden file testing for Go
Go
226
star
4

meteor-chat-tutorial

Example Meteor chat application supporting GitHub authentication.
JavaScript
187
star
5

aws-ec2-assign-elastic-ip

Automatically assign Elastic IPs to AWS EC2 Auto Scaling Group instances
Python
176
star
6

git-pylint-commit-hook

Git pre-commit hook to check Python code quality with pylint. You can use this hook to prohibit Python code with a bad syntax to be checked in.
Python
143
star
7

dotfiles

My dotfiles
Vim Script
130
star
8

automated-ebs-snapshots

Script for managing automated AWS EBS snapshots
Python
89
star
9

meteor-autocompletion

Autocompletion using data from MeteorJS collections
JavaScript
48
star
10

markdown-toc

Table of Contents generator for Markdown
Go
22
star
11

pok3r-layout

Vortex Pok3r layout for Linux users
16
star
12

elliot.js

Near real time graph library, see web page or demo.html for examples
JavaScript
11
star
13

dynamodb-config-store

Store your configuration in DynamoDB
Python
11
star
14

vim-ide

Sebastian's vim IDE
Vim Script
8
star
15

password-generator

Simple password generator module for Python
Python
8
star
16

yayson

Yay! Beautiful JSON on the command line!
Python
3
star
17

procmon

Simple utility for checking if processes are running
Python
3
star
18

python-inspector

Used to track down which Python module and script that called your method / function. It will show you the exact Python file, line number and the actual line that made the call. The module is plug'n'playable, just import it and place a non-interfering hook in your code.
Python
3
star
19

markdown-docs

markdown-docs is a documentation generator for projects using Markdown. The problem with having Markdown files spread around your project is that it is hard to get an overview of all your documentation. markdown-docs solves this by collecting all of your Markdown files into one browsable HTML hierarchy.
Python
3
star
20

github-kanban

JavaScript
2
star
21

relic

AWS Cloud management tool written in Python 2.7 and Django
JavaScript
1
star
22

recharged

Electrical vehicle charging system
Go
1
star
23

chartista.js

JavaScript library for graphing
JavaScript
1
star
24

watchtower

Pythonic monitoring software
Python
1
star
25

answering-robot

Fooling around with Python and text searching
Python
1
star
26

meteor-parent-data-example

Example showing how to access parent data in MeteorJS.
JavaScript
1
star