• Stars
    star
    3,061
  • Rank 14,722 (Top 0.3 %)
  • Language
    Python
  • Created almost 11 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Multifarious Scrapy examples. Spiders for alexa / amazon / douban / douyu / github / linkedin etc.

scrapy-examples

Multifarious scrapy examples with integrated proxies and agents, which make you comfy to write a spider.

Don't use it to do anything illegal!


Real spider example: doubanbook

Tutorial

git clone https://github.com/geekan/scrapy-examples
cd scrapy-examples/doubanbook
scrapy crawl doubanbook

Depth

There are several depths in the spider, and the spider gets real data from depth2.

  • Depth0: The entrance is http://book.douban.com/tag/
  • Depth1: Urls like http://book.douban.com/tag/外国文学 from depth0
  • Depth2: Urls like http://book.douban.com/subject/1770782/ from depth1

Example image

douban book


Avaiable Spiders

  • tutorial
    • dmoz_item
    • douban_book
    • page_recorder
    • douban_tag_book
  • doubanbook
  • linkedin
  • hrtencent
  • sis
  • zhihu
  • alexa
    • alexa
    • alexa.cn

Advanced

  • Use parse_with_rules to write a spider quickly.
    See dmoz spider for more details.

  • Proxies

    • If you don't want to use proxy, just comment the proxy middleware in settings.
    • If you want to custom it, hack misc/proxy.py by yourself.
  • Notice

    • Don't use parse as your method name, it's an inner method of CrawlSpider.

Advanced Usage

  • Run ./startproject.sh <PROJECT> to start a new project.
    It will automatically generate most things, the only left things are:
    • PROJECT/PROJECT/items.py
    • PROJECT/PROJECT/spider/spider.py

Example to hack items.py and spider.py

Hacked items.py with additional fields url and description:

from scrapy.item import Item, Field

class exampleItem(Item):
    url = Field()
    name = Field()
    description = Field()

Hacked spider.py with start rules and css rules (here only display the class exampleSpider):

class exampleSpider(CommonSpider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.com/",
    ]
    # Crawler would start on start_urls, and follow the valid urls allowed by below rules.
    rules = [
        Rule(sle(allow=["/Arts/", "/Games/"]), callback='parse', follow=True),
    ]

    css_rules = {
        '.directory-url li': {
            '__use': 'dump', # dump data directly
            '__list': True, # it's a list
            'url': 'li > a::attr(href)',
            'name': 'a::text',
            'description': 'li::text',
        }
    }

    def parse(self, response):
        info('Parse '+response.url)
        # parse_with_rules is implemented here:
        #   https://github.com/geekan/scrapy-examples/blob/master/misc/spider.py
        self.parse_with_rules(response, self.css_rules, exampleItem)

More Repositories

1

MetaGPT

🌟 The Multi-Agent Framework: First AI Software Company, Towards Natural Language Programming
Python
44,594
star
2

HowToLiveLonger

程序员延寿指南 | A programmer's guide to live longer
27,114
star
3

one-python

We don't need a lot of libraries. We just need the best ones. | Unofficial recommended first choice.
712
star
4

awesome-awesome-awesome

An awesome-awesome list.
280
star
5

source-insight-vim

source-insight-like vim.
Vim Script
91
star
6

MetaGPT-docs

TypeScript
58
star
7

openflow_translation

OpenFlow中文翻译工作。
50
star
8

cowry

Private plage, storing all my cowries.
Python
46
star
9

HowToLiveWithCovid

40
star
10

google-scholar-crawler

Crawl google scholar with least code.
Python
38
star
11

path-to-cs-engineer

(金色传说)程序员之路
34
star
12

coding_marathon

12
star
13

python-dnspod-ddns

a python client for dnspod's ddns
Python
7
star
14

citation-graph

Citations are like a graph, follow the history, citations show how knowledge inherited.
HTML
7
star
15

anwcl.about

My personal site's about page.
JavaScript
6
star
16

c-algorithm

Implement C algorithms with highest performance
C
5
star
17

grpc-python-demos

Python
5
star
18

crawl_imgs

Use 128 processes to crawl imgs.
Python
5
star
19

recsys-on-deeplearning

Awesome papers / frameworks / libraries focus on recsys on deep learning.
4
star
20

mini-games

收集好玩的迷你游戏以及相关内容(视频、站点)等
4
star
21

github_marathon

marathon for 365 days!
3
star
22

scrapy-general-spider

Create spider by writing a simple config.
Python
2
star
23

scrapy-live-portal

Python
2
star
24

lightmr

Lightweight map reduce framework.
Python
2
star
25

scrapy-css-rule-spider

Python
1
star
26

MetaGPT-demos

1
star
27

psu

Python shell-like utilities.
Python
1
star
28

dotfiles

Shell
1
star
29

china-coder-ranking

1
star
30

python-crawlers

About crawlers written in python.
1
star
31

geekan.github.com

a personal static website. (now redirect to www.anwcl.com)
CSS
1
star
32

django1.5-bootstrap3-templates

updated django-registration bootstrap templates
Python
1
star
33

css-rules

All public css rules.
Python
1
star