Top Rating
- Top Contributors
  Discover the Top Open Source contributors by country or by language
- Interviews
  Discover real stories from Open Source developers
Discover

Discover your Favorite Language
Discover the top trending repositories and projects on Github. Explore the latest trends in your preferred languages.

Emacs Lisp

R

Erlang

Dart

Jupyter Notebook

Assembly

Lua

Shell

More Languages
Awesome

Awesome repositories
Discover the most awesome repositories and projects of your favorite languages. Inspired by the Awesome-* lists trend in GitHub.

Java

Scala

Rust

Lua

Clojure

Go

Groovy

Perl

More Languages
By Country

Rankings by Country
Discover the community of talented open source contributors in each country.

🇻🇺 Vanuatu

🇹🇬 Togo

🇬🇪 Georgia

🇷🇪 Réunion

🇦🇫 Afghanistan

🇫🇷 France

🇳🇦 Namibia

🇱🇰 Sri Lanka

All Countries Compare Countries

rmax/dirbot-mysql

Stars
117
Rank 301,828 (Top 6 %)
Language
Python
Created over 14 years ago
Updated about 11 years ago

rmax/dirbot-mysql

rmax

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Scrapy project based on dirbot to show how to use Twisted's adbapi to store the scraped data in MySQL.

dirbot

This is a Scrapy project to scrape websites from public web directories.

This project is only meant for educational purposes.

Items

The items scraped by this project are websites, and the item is defined in the class:

dirbot.items.Website

See the source code for more details.

Spiders

This project contains one spider called dmoz that you can see by running:

scrapy list

Spider: dmoz

The dmoz spider scrapes the Open Directory Project (dmoz.org), and it's based on the dmoz spider described in the Scrapy tutorial

This spider doesn't crawl the entire dmoz.org site but only a few pages by default (defined in the start_pages attribute). These pages are:

So, if you run the spider regularly (with scrapy crawl dmoz) it will scrape only those two pages.

Pipelines

Filtering by words

A pipeline to filter out websites containing certain forbidden words in their description. This pipeline is defined in the class:

dirbot.pipelines.FilterWordsPipeline

Requiring certain item fields

A pipeline to discard items that lack of certain fields. This pipeline is defined in the class:

dirbot.pipelines.RequiredFieldsPipeline

Storing items in a MySQL database

A pipeline to store (insert or update) scraped items in a MySQL database. This pipeline is defined in the class:

dirbot.pipelines.MySQLStorePipeline

The database schema is defined in db/mysql.sql and the settings file contains the default MYSQL_* settings values. The scraped items will be stored in the website database table.

Note

It is required to have set up the database schema before running the spider.

scrapy-redis

Redis-based components for Scrapy.

scrapy-inline-requests

A decorator to write coroutine-like spider callbacks.

django-dummyimage

Dynamic Dummy Image Generator For Django!

scrapy-boilerplate

Small set of utilities to simplify writing Scrapy spiders.

scrapydo

Crochet-based blocking API for Scrapy.

Jupyter Notebook

databrewer

The missing datasets manager. Like hombrew but for datasets. CLI-tool for search and discover datasets!

databrewer-recipes

DataBrewer Recipes Repository.

django-on-tornado

Run django on tornado webserver

webfaction-stuff

random stuff to manage your own webfaction hosting

parsel-cli

Parsel Command Line Interface

leveldict

LevelDB dict-like wrappers.

cookiecutter-scrapycloud

A bare minimum Scrapy project template ready for Scrapinghub's Scrapy Cloud service.

Facebook-Hacker-Cup-Results

txrho

misc stuff on top twisted/cyclone

Django-Dash-2010

Repository for Django Dash 2010

awesome-codename

Generate awesome codenames

Random-Code

dask-avro

Avro reader for Dask.

mit-ocw-crawler

MIT's OCW Crawler

anaconda-manylinux-builder

Scripts to build manylinux wheels in Travis CI and upload them in Anaconda.org

persistent-homology-examples

Examples of computing the persistent homology of miscellaneous data sets.

yatiri

programming-challenges

My attempt to improve my algorithm skills. Starting from basic.

dask-kafka

Dask-Kafka reader

dotfiles

My dot files. DEPRECATED. Go -> https://github.com/rmax/dotfiles-ng

dockerfiles

Collection of dockerfiles.

scrapy-slidebot

A collection of Spiders to download slides as PDFs from popular sites like slideshare and speakerdeck.

gyst

A pythonic tool to post gists

haanga-benchs

Haanga's benchmarks port over Tornado Framework

scrapyorg-infinit-crawler

rmax.github.io

code-katas

fastavro-codecs

login_signup

friendly login+signup form

lmbot

cookiecutter-datapackage

rmax

ipynb

Assorted collection of iPython notebooks.

django-ipcountry

dask-elasticsearch

An Elasticsearch reader for Dask

python-benchmarks

Assorted python-based benchmarks

binary-repr

Converts integers to binary representation.

django_inline_example

django dynamic inline example

yammh3

Yet another Murmurhash3 bindings.

my-django-project-template

pmwiki-authelgg

omp-thread-count

A small Python module to get the actual number of threads used by OMP via Cython bindings.

zend-ajax-form-test

rho-blogs-crawler

A Scrapy project to export my legacy blogs

dotfiles-ng

YADM-managed dot files