• Stars
    star
    150
  • Rank 242,390 (Top 5 %)
  • Language
    Python
  • License
    MIT License
  • Created over 12 years ago
  • Updated about 11 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A scrapy-based Hacker News crawler.

HNCrawl

A scrapy-based Hacker News crawler.

Introduction

HNCrawl is a tiny, simple scrapy-based crawler which grabs the html content of pages linked to the front page of hacker news.

Examples

Installation

$ pip install scrapy
$ git clone [email protected]:mvanveen/hncrawl.git

Scraping

Note: Please be sure to keep in mind that the Crawl-Delay value set in the HN robots.txt file is set to 30 seconds. Please be sure to avoid using the scraper more than once per 30 seconds!

Scrape the links from the front page of HN

$ scrapy crawl hnspider

Scrape items and return json summary of items scraped into items.json

$ scrapy crawl alias_scrape -o items.json -t json

Output

Here is an example file hierarchy. Folders are a hex digest of the SHA1 hash of the hacker news item url.

 โ”œโ”€โ”€ out
 โ”‚ย ย  โ”œโ”€โ”€ 000f86c7547b47a700dee0879a0fe08b4597360f
 โ”‚ย ย  โ”‚ย ย  โ””โ”€โ”€ index.html
 โ”‚ย ย  โ”œโ”€โ”€ 0190cbad182ab3bc9a92482d169f38e363ca3c57
 โ”‚ย ย  โ”‚ย ย  โ””โ”€โ”€ index.html
 โ”‚ย ย  โ”œโ”€โ”€ 02bae9642c8dd4b75a593c1c42beff62824ee8fc
 โ”‚ย ย  โ”‚ย ย  โ””โ”€โ”€ index.html
 โ”‚ย ย  โ”œโ”€โ”€ 05c1460571f0ac45f77bf2ecbd3cba8b85c20621
 โ”‚ย ย  โ”‚ย ย  โ””โ”€โ”€ index.html
 โ”‚ย ย  โ”œโ”€โ”€ 0b1587a3dbe9996d10a0fd3250f75462ebd59a0b
 โ”‚ย ย  โ”‚ย ย  โ””โ”€โ”€ index.html
 โ”‚ย ย  โ”œโ”€โ”€ 0c5c67585004e03341e6a87d2db5257b93337b86
 โ”‚ย ย  โ”‚ย ย  โ””โ”€โ”€ 

The JSON summary of news items look like this:

{'title': u'EFF Wins Protection for Time Zone Database',
 'url': u'https://www.eff.org/press/releases/eff-wins-protection-time-zone-database'}

Dependencies

License

HNCrawl is MIT licensed.

More Repositories

1

freedom.txt

Publicly Support an Open Internet
Python
21
star
2

mcut

Python Implementation of Median Cut Color Quantization
Python
12
star
3

Dropblog

Dropbox powered, App Engine Hosted
Python
5
star
4

alias_stats

A Markdown and Javascript notebook for GitHub UNIX alias stats.
JavaScript
4
star
5

Decantor

JSON to CSV conversion tool
Python
4
star
6

alias-scrape

Scrape Github Aliases For Great Justice
Python
4
star
7

al

Call me al
Python
3
star
8

lgtm

Looks good to me
3
star
9

cargo

Docker Containers for Humansโ„ข
Python
3
star
10

Python-101

A workspace for the Getaround Python 101 sessions
Shell
2
star
11

lfs

This my linux from scratch. There are many like it- this one is mine
Shell
2
star
12

dot_files

dot files go here!
Vim Script
2
star
13

Tacco

Talk is cheap. Tacco is free. Plain & simple Python documenter.
Python
2
star
14

UCD-ECS193-HandprintDetection

UCD ECS193 Handprint Detection Project
Python
1
star
15

postbase

SQLite-hosted HTTP Request Logging for Whoeverโ„ข
Python
1
star
16

vim-tips

A UI for "Best of Vim Tips"
JavaScript
1
star
17

Resume

My Resume
TeX
1
star
18

yufollowme

JavaScript
1
star
19

kanjibrowse

Browse kanji by structure.
Haskell
1
star
20

Python-Hacks

1
star
21

canvasflood

display all the packets!
Go
1
star
22

google-api-python-client

Automatically exported from code.google.com/p/google-api-python-client
Python
1
star
23

musicDB

A music management toolset written in python
Python
1
star
24

lightswitch

How many lines of code does it take to turn on a lightbulb?
JavaScript
1
star
25

glitchaday

HTML
1
star
26

majorMajor

A distributed web crawling platform written in python
Python
1
star
27

unblockr

Unblock me!
JavaScript
1
star