• Stars
    star
    117
  • Rank 301,828 (Top 6 %)
  • Language
    Python
  • Created over 14 years ago
  • Updated about 11 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Scrapy project based on dirbot to show how to use Twisted's adbapi to store the scraped data in MySQL.

dirbot

This is a Scrapy project to scrape websites from public web directories.

This project is only meant for educational purposes.

Items

The items scraped by this project are websites, and the item is defined in the class:

dirbot.items.Website

See the source code for more details.

Spiders

This project contains one spider called dmoz that you can see by running:

scrapy list

Spider: dmoz

The dmoz spider scrapes the Open Directory Project (dmoz.org), and it's based on the dmoz spider described in the Scrapy tutorial

This spider doesn't crawl the entire dmoz.org site but only a few pages by default (defined in the start_pages attribute). These pages are:

So, if you run the spider regularly (with scrapy crawl dmoz) it will scrape only those two pages.

Pipelines

Filtering by words

A pipeline to filter out websites containing certain forbidden words in their description. This pipeline is defined in the class:

dirbot.pipelines.FilterWordsPipeline

Requiring certain item fields

A pipeline to discard items that lack of certain fields. This pipeline is defined in the class:

dirbot.pipelines.RequiredFieldsPipeline

Storing items in a MySQL database

A pipeline to store (insert or update) scraped items in a MySQL database. This pipeline is defined in the class:

dirbot.pipelines.MySQLStorePipeline

The database schema is defined in db/mysql.sql and the settings file contains the default MYSQL_* settings values. The scraped items will be stored in the website database table.

Note

It is required to have set up the database schema before running the spider.

More Repositories

1

scrapy-redis

Redis-based components for Scrapy.
Python
5,503
star
2

scrapy-inline-requests

A decorator to write coroutine-like spider callbacks.
Python
112
star
3

django-dummyimage

Dynamic Dummy Image Generator For Django!
Python
55
star
4

scrapy-boilerplate

Small set of utilities to simplify writing Scrapy spiders.
Python
49
star
5

scrapydo

Crochet-based blocking API for Scrapy.
Jupyter Notebook
46
star
6

databrewer

The missing datasets manager. Like hombrew but for datasets. CLI-tool for search and discover datasets!
Python
41
star
7

databrewer-recipes

DataBrewer Recipes Repository.
Python
21
star
8

django-on-tornado

Run django on tornado webserver
Python
15
star
9

webfaction-stuff

random stuff to manage your own webfaction hosting
Python
9
star
10

parsel-cli

Parsel Command Line Interface
Python
9
star
11

leveldict

LevelDB dict-like wrappers.
Python
7
star
12

cookiecutter-scrapycloud

A bare minimum Scrapy project template ready for Scrapinghub's Scrapy Cloud service.
Python
7
star
13

Facebook-Hacker-Cup-Results

C++
6
star
14

txrho

misc stuff on top twisted/cyclone
Python
6
star
15

Django-Dash-2010

Repository for Django Dash 2010
JavaScript
6
star
16

awesome-codename

Generate awesome codenames
Makefile
5
star
17

Random-Code

Random code
Python
4
star
18

dask-avro

Avro reader for Dask.
Python
4
star
19

mit-ocw-crawler

MIT's OCW Crawler
Python
4
star
20

anaconda-manylinux-builder

Scripts to build manylinux wheels in Travis CI and upload them in Anaconda.org
Shell
3
star
21

persistent-homology-examples

Examples of computing the persistent homology of miscellaneous data sets.
3
star
22

yatiri

Python
3
star
23

programming-challenges

My attempt to improve my algorithm skills. Starting from basic.
C++
3
star
24

dask-kafka

Dask-Kafka reader
Python
2
star
25

dotfiles

My dot files. DEPRECATED. Go -> https://github.com/rmax/dotfiles-ng
Vim Script
2
star
26

dockerfiles

Collection of dockerfiles.
Shell
2
star
27

scrapy-slidebot

A collection of Spiders to download slides as PDFs from popular sites like slideshare and speakerdeck.
Python
2
star
28

gyst

A pythonic tool to post gists
Python
2
star
29

haanga-benchs

Haanga's benchmarks port over Tornado Framework
PHP
2
star
30

scrapyorg-infinit-crawler

Python
1
star
31

rmax.github.io

CSS
1
star
32

code-katas

My code katas
Python
1
star
33

fastavro-codecs

1
star
34

login_signup

friendly login+signup form
JavaScript
1
star
35

lmbot

1
star
36

cookiecutter-datapackage

Makefile
1
star
37

rmax

1
star
38

ipynb

Assorted collection of iPython notebooks.
1
star
39

django-ipcountry

Python
1
star
40

dask-elasticsearch

An Elasticsearch reader for Dask
Python
1
star
41

python-benchmarks

Assorted python-based benchmarks
Python
1
star
42

binary-repr

Converts integers to binary representation.
Python
1
star
43

django_inline_example

django dynamic inline example
Python
1
star
44

yammh3

Yet another Murmurhash3 bindings.
Python
1
star
45

my-django-project-template

CSS
1
star
46

pmwiki-authelgg

PHP
1
star
47

omp-thread-count

A small Python module to get the actual number of threads used by OMP via Cython bindings.
Python
1
star
48

zend-ajax-form-test

PHP
1
star
49

rho-blogs-crawler

A Scrapy project to export my legacy blogs
Python
1
star
50

dotfiles-ng

YADM-managed dot files
Vim Script
1
star