• Stars
    star
    499
  • Rank 87,718 (Top 2 %)
  • Language
    Python
  • License
    BSD 3-Clause "New...
  • Created over 9 years ago
  • Updated 11 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Scrapy extension to write scraped items using Django models

scrapy-djangoitem

PyPI Version Build Status License

scrapy-djangoitem is an extension that allows you to define Scrapy items using existing Django models.

This utility provides a new class, named DjangoItem, that you can use as a regular Scrapy item and link it to a Django model with its django_model attribute. Start using it right away by importing it from this package:

from scrapy_djangoitem import DjangoItem

Installation

Starting with v1.1 both Python 2.7 and Python 3.4/3.5 are supported. For Python 3 you need Scrapy v1.1 or above.

Latest tested Django version is Django 1.9.

Install from PyPI using:

pip install scrapy-djangoitem

Introduction

DjangoItem is a class of item that gets its fields definition from a Django model, you simply create a DjangoItem and specify what Django model it relates to.

Besides of getting the model fields defined on your item, DjangoItem provides a method to create and populate a Django model instance with the item data.

Usage

DjangoItem works much like ModelForms in Django, you create a subclass and define its django_model attribute to be a valid Django model. With this you will get an item with a field for each Django model field.

In addition, you can define fields that aren't present in the model and even override fields that are present in the model defining them in the item.

Let's see some examples:

Creating a Django model for the examples:

from django.db import models

class Person(models.Model):
    name = models.CharField(max_length=255)
    age = models.IntegerField()

Defining a basic DjangoItem:

from scrapy_djangoitem import DjangoItem

class PersonItem(DjangoItem):
    django_model = Person

DjangoItem works just like Scrapy items:

>>> p = PersonItem()
>>> p['name'] = 'John'
>>> p['age'] = '22'

To obtain the Django model from the item, we call the extra method DjangoItem.save() of the DjangoItem:

>>> person = p.save()
>>> person.name
'John'
>>> person.age
'22'
>>> person.id
1

The model is already saved when we call DjangoItem.save(), we can prevent this by calling it with commit=False. We can use commit=False in DjangoItem.save() method to obtain an unsaved model:

>>> person = p.save(commit=False)
>>> person.name
'John'
>>> person.age
'22'
>>> person.id
None

As said before, we can add other fields to the item:

import scrapy
from scrapy_djangoitem import DjangoItem

class PersonItem(DjangoItem):
    django_model = Person
    sex = scrapy.Field()
>>> p = PersonItem()
>>> p['name'] = 'John'
>>> p['age'] = '22'
>>> p['sex'] = 'M'

And we can override the fields of the model with your own:

class PersonItem(DjangoItem):
    django_model = Person
    name = scrapy.Field(default='No Name')

This is useful to provide properties to the field, like a default or any other property that your project uses. Those additional fields won't be taken into account when doing a DjangoItem.save().

Caveats

DjangoItem is a rather convenient way to integrate Scrapy projects with Django models, but bear in mind that Django ORM may not scale well if you scrape a lot of items (ie. millions) with Scrapy. This is because a relational backend is often not a good choice for a write intensive applications (such as a web crawler), specially if the database is highly normalized and with many indices.

Setup

To use the Django models outside the Django application you need to set up the DJANGO_SETTINGS_MODULE environment variable and --in most cases-- modify the PYTHONPATH environment variable to be able to import the settings module.

There are many ways to do this depending on your use case and preferences. Below is detailed one of the simplest ways to do it.

Suppose your Django project is named mysite, is located in the path /home/projects/mysite and you have created an app myapp with the model Person. That means your directory structure is something like this:

/home/projects/mysite
├── manage.py
├── myapp
│   ├── __init__.py
│   ├── models.py
│   ├── tests.py
│   └── views.py
└── mysite
    ├── __init__.py
    ├── settings.py
    ├── urls.py
    └── wsgi.py

Then you need to add /home/projects/mysite to the PYTHONPATH environment variable and set up the environment variable DJANGO_SETTINGS_MODULE to mysite.settings. That can be done in your Scrapy's settings file by adding the lines below:

import sys
sys.path.append('/home/projects/mysite')

import os
os.environ['DJANGO_SETTINGS_MODULE'] = 'mysite.settings'

Notice that we modify the sys.path variable instead the PYTHONPATH environment variable as we are already within the python runtime. If everything is right, you should be able to start the scrapy shell command and import the model Person (i.e. from myapp.models import Person).

Starting with Django 1.8 you also have to explicitly set up Django if using it outside a manage.py context (see Django Docs):

import django
django.setup()

Development

Test suite from the tests directory can be run using tox by running:

tox

...using the configuration in tox.ini. The Python interpreters used have to be installed locally on the system.

Changelog

v1.1.1 (2016-05-04)

  • Distribute as universal wheel
  • Fix README's markup

v1.1 (2016-05-04)

  • Python 3.4/3.5 support
  • Making tests work with Django 1.9 again

v1.0 (2015-04-29)

  • Initial version

More Repositories

1

scrapy-splash

Scrapy+Splash for JavaScript integration
Python
3,115
star
2

scrapy-playwright

🎭 Playwright integration for Scrapy
Python
944
star
3

scrapy-zyte-smartproxy

Zyte Smart Proxy Manager (formerly Crawlera) middleware for Scrapy
Python
354
star
4

scrapy-jsonrpc

Scrapy extension to control spiders using JSON-RPC
Python
296
star
5

scrapy-deltafetch

Scrapy spider middleware to ignore requests to pages containing items seen in previous crawls
Python
264
star
6

scrapy-magicfields

Scrapy middleware to add extra fields to items, like timestamp, response fields, spider attributes etc.
Python
55
star
7

scrapy-jsonschema

Scrapy schema validation pipeline and Item builder using JSON Schema
Python
44
star
8

scrapy-monkeylearn

A Scrapy pipeline to categorize items using MonkeyLearn
Python
37
star
9

scrapy-zyte-api

Zyte API integration for Scrapy
Python
33
star
10

scrapy-headless

Python
29
star
11

scrapy-pagestorage

A scrapy extension to store requests and responses information in storage service
Python
26
star
12

scrapy-querycleaner

Scrapy spider middleware to clean up query parameters in request URLs
Python
25
star
13

scrapy-splitvariants

Scrapy spider middleware to split an item into multiple items using a multi-valued key
Python
20
star
14

scrapy-streaming

Python
17
star
15

scrapy-dotpersistence

A scrapy extension to sync `.scrapy` folder to an S3 bucket
Python
16
star
16

scrapy-streamitem

Scrapy support for working with streamcorpus Stream Items.
Python
11
star
17

scrapy-crawlera-fetch

Scrapy Downloader Middleware for Crawlera Fetch API
Python
8
star
18

scrapy-feedexporter-sftp

Python
6
star
19

scrapy-statsd

Python
6
star
20

scrapy-bigml

Scrapy pipeline for writing items to BigML datasets
Python
4
star
21

scrapy-spider-metadata

Python
4
star
22

scrapy-hcf

Scrapy spider middleware to use Scrapinghub's Hub Crawl Frontier as a backend for URLs
Python
4
star
23

scrapy-snowflake-stage-exporter

Snowflake database loading utility with Scrapy integration
Python
4
star
24

scrapy-feedexporter-google-drive

Python
3
star
25

scrapy-feedexporter-azure-storage

Python
2
star
26

scrapy-feedexporter-onedrive

Export to OneDrive
Python
1
star
27

scrapy-incremental

Python
1
star
28

scrapy-feedexporter-dropbox

Scrapy feed exporter for Dropbox
Python
1
star
29

scrapy-feedexporter-google-sheets

Python
1
star