• Stars
    star
    185
  • Rank 208,271 (Top 5 %)
  • Language
    Python
  • Created over 13 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations.

Pythonic Crawling / Scraping Framework Built on Eventlet


Build Status Code Climate Stories in Ready

Features

  • High Speed WebCrawler built on Eventlet.
  • Supports relational databases engines like Postgre, Mysql, Oracle, Sqlite.
  • Supports NoSQL databased like Mongodb and Couchdb. New!
  • Export your data into Json, XML or CSV formats. New!
  • Command line tools.
  • Extract data using your favourite tool. XPath or Pyquery (A Jquery-like library for python).
  • Cookie Handlers.
  • Very easy to use (see the example).

Documentation

http://packages.python.org/crawley/

Project WebSite

http://project.crawley-cloud.com/


To install crawley run

~$ python setup.py install

or from pip

~$ pip install crawley

To start a new project run

~$ crawley startproject [project_name]
~$ cd [project_name]

Write your Models

""" models.py """

from crawley.persistance import Entity, UrlEntity, Field, Unicode

class Package(Entity):
    
    #add your table fields here
    updated = Field(Unicode(255))    
    package = Field(Unicode(255))
    description = Field(Unicode(255))

Write your Scrapers

""" crawlers.py """

from crawley.crawlers import BaseCrawler
from crawley.scrapers import BaseScraper
from crawley.extractors import XPathExtractor
from models import *

class pypiScraper(BaseScraper):
    
    #specify the urls that can be scraped by this class
    matching_urls = ["%"]
    
    def scrape(self, response):
                        
        #getting the current document's url.
        current_url = response.url        
        #getting the html table.
        table = response.html.xpath("/html/body/div[5]/div/div/div[3]/table")[0]
        
        #for rows 1 to n-1
        for tr in table[1:-1]:
                        
            #obtaining the searched html inside the rows
            td_updated = tr[0]
            td_package = tr[1]
            package_link = td_package[0]
            td_description = tr[2]
            
            #storing data in Packages table
            Package(updated=td_updated.text, package=package_link.text, description=td_description.text)


class pypiCrawler(BaseCrawler):
    
    #add your starting urls here
    start_urls = ["http://pypi.python.org/pypi"]
    
    #add your scraper classes here    
    scrapers = [pypiScraper]
    
    #specify you maximum crawling depth level    
    max_depth = 0
    
    #select your favourite HTML parsing tool
    extractor = XPathExtractor

Configure your settings

""" settings.py """

import os 
PATH = os.path.dirname(os.path.abspath(__file__))

#Don't change this if you don't have renamed the project
PROJECT_NAME = "pypi"
PROJECT_ROOT = os.path.join(PATH, PROJECT_NAME)

DATABASE_ENGINE = 'sqlite'     
DATABASE_NAME = 'pypi'  
DATABASE_USER = ''             
DATABASE_PASSWORD = ''         
DATABASE_HOST = ''             
DATABASE_PORT = ''     

SHOW_DEBUG_INFO = True

Finally, just run the crawler

~$ crawley run

More Repositories

1

pyfb

A Python Interface for the Facebook Graph API
Python
72
star
2

data-layer-generator

A data layer generator for python domain objects
Python
53
star
3

node-simple-chat

A Real Time Chat Built on Node-js and Socket.io
JavaScript
26
star
4

filenergy

File sharing tool written in python
JavaScript
10
star
5

simple-web-browser

A Simple Fully Functional Web Browser implemented over Qt and QtWebKit.
Python
9
star
6

py2nsis

A Nsis installers generator for python projects
Python
8
star
7

PyMusic

A 100% Python Open Source Music Player!
Python
6
star
8

django_deployment

Easy deploy for your django apps
Python
6
star
9

hackersprojects

http://www.hackersprojects.com/ website code
JavaScript
5
star
10

elixir

Elixir's git mirror from http://elixir.ematia.de/svn
Python
5
star
11

django_conventions

Django Convention Over Configuration Routing Plugin
Python
4
star
12

pyTateti

A Classic 3-InLine game with an Invincible IA
Python
3
star
13

scrabbly-cloud

scrabble online platform built on the top of scrabbly engine
Python
3
star
14

erequests

Requests + Eventlet
Python
3
star
15

scrabbly

Scrabble game engine implemented in pure python
Python
3
star
16

backbone-tastypie-example

Django Tastypie and Backbone app example
JavaScript
3
star
17

crawley-ruby

A ruby implementation of the crawley framework
Ruby
3
star
18

potion

Elixir-like declarative layer for sql-alchemy
Python
2
star
19

rq_stop_job

RQ stoppable job for Django-RQ
Python
2
star
20

node-proxy-server

A HTTP Proxy Server with customizable request handlers running over NodeJs.
JavaScript
2
star
21

sciencecombinator

The Science Videos Combinator
CSS
2
star
22

Tweet-Bot

A bot capable to randomly choose from a list and teewt periodically
Python
2
star
23

crawley3-toolbox

Crawley Toolbox for python3
Python
2
star
24

bitcoin_android

Convert Bitcoin in android
Python
1
star
25

nttp-server

NTTP/HTTP Distributed Server/Clients
C
1
star
26

math-chat

A web chat application that allows users write latex commands to draw complex mathematical symbols.
JavaScript
1
star
27

hackathonBB

BB Jam Sessions
Python
1
star
28

coffeefb

Facebook graph api wrapper for CoffeeScript
CoffeeScript
1
star
29

jcaching

Java caching framework
Java
1
star
30

mandelbrot.jl

Flask/Sinatra like Micro Framework
Julia
1
star
31

crypto_currencies_conversor

Crypto Currencies Conversor for android
Java
1
star
32

flickr_scripts

Scripts to upload and manage photos on flickr
Python
1
star
33

pycaching

JCaching framework implementation in python
Python
1
star
34

django_content_types_example

Content types framework [https://docs.djangoproject.com/en/dev/ref/contrib/contenttypes/] example
Python
1
star
35

RbFb

Ruby wrapper for the Facebook Graph API
Ruby
1
star
36

jmg.github.com

My Web Site on Github
HTML
1
star
37

py-mini-orm

Another Python ORM
Python
1
star
38

builder

A social network for building projects
JavaScript
1
star