Top Rating
- Top Contributors
  Discover the Top Open Source contributors by country or by language
- Interviews
  Discover real stories from Open Source developers
Discover

Discover your Favorite Language
Discover the top trending repositories and projects on Github. Explore the latest trends in your preferred languages.

HTML

Ruby

Shell

Elixir

C++

Haskell

R

Solidity

More Languages
Awesome

Awesome repositories
Discover the most awesome repositories and projects of your favorite languages. Inspired by the Awesome-* lists trend in GitHub.

Go

PHP

Zig

F#

PowerShell

Elixir

R

MATLAB

More Languages
By Country

Rankings by Country
Discover the community of talented open source contributors in each country.

🇬🇺 Guam

🇸🇪 Sweden

🇧🇷 Brazil

🇬🇳 Guinea

🇬🇩 Grenada

🇸🇰 Slovakia

🇱🇹 Lithuania

🇻🇬 British Virgin Islands

All Countries Compare Countries

mvanveen/hncrawl

Stars
150
Rank 247,323 (Top 5 %)
Language
Python
License
MIT License
Created over 12 years ago
Updated over 11 years ago

mvanveen/hncrawl

mvanveen

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

A scrapy-based Hacker News crawler.

HNCrawl

A scrapy-based Hacker News crawler.

Introduction

HNCrawl is a tiny, simple scrapy-based crawler which grabs the html content of pages linked to the front page of hacker news.

Examples

Installation

$ pip install scrapy
$ git clone [email protected]:mvanveen/hncrawl.git

Scraping

Note: Please be sure to keep in mind that the Crawl-Delay value set in the HN robots.txt file is set to 30 seconds. Please be sure to avoid using the scraper more than once per 30 seconds!

Scrape the links from the front page of HN

$ scrapy crawl hnspider

Scrape items and return json summary of items scraped into items.json

$ scrapy crawl alias_scrape -o items.json -t json

Output

Here is an example file hierarchy. Folders are a hex digest of the SHA1 hash of the hacker news item url.

 ├── out
 │   ├── 000f86c7547b47a700dee0879a0fe08b4597360f
 │   │   └── index.html
 │   ├── 0190cbad182ab3bc9a92482d169f38e363ca3c57
 │   │   └── index.html
 │   ├── 02bae9642c8dd4b75a593c1c42beff62824ee8fc
 │   │   └── index.html
 │   ├── 05c1460571f0ac45f77bf2ecbd3cba8b85c20621
 │   │   └── index.html
 │   ├── 0b1587a3dbe9996d10a0fd3250f75462ebd59a0b
 │   │   └── index.html
 │   ├── 0c5c67585004e03341e6a87d2db5257b93337b86
 │   │   └──

The JSON summary of news items look like this:

{'title': u'EFF Wins Protection for Time Zone Database',
 'url': u'https://www.eff.org/press/releases/eff-wins-protection-time-zone-database'}

Dependencies

License

HNCrawl is MIT licensed.

freedom.txt

Publicly Support an Open Internet

mcut

Python Implementation of Median Cut Color Quantization

Dropblog

Dropbox powered, App Engine Hosted

alias_stats

A Markdown and Javascript notebook for GitHub UNIX alias stats.

Decantor

JSON to CSV conversion tool

alias-scrape

Scrape Github Aliases For Great Justice

al

lgtm

Looks good to me

cargo

Docker Containers for Humans™

Python-101

A workspace for the Getaround Python 101 sessions

lfs

This my linux from scratch. There are many like it- this one is mine

dot_files

dot files go here!

Tacco

Talk is cheap. Tacco is free. Plain & simple Python documenter.

UCD-ECS193-HandprintDetection

UCD ECS193 Handprint Detection Project

postbase

SQLite-hosted HTTP Request Logging for Whoever™

vim-tips

A UI for "Best of Vim Tips"

Resume

yufollowme

kanjibrowse

Browse kanji by structure.

Python-Hacks

canvasflood

display all the packets!

musicDB

A music management toolset written in python

google-api-python-client

Automatically exported from code.google.com/p/google-api-python-client

lightswitch

How many lines of code does it take to turn on a lightbulb?

bookmarks

hastily written bookmark manager

glitchaday

majorMajor

A distributed web crawling platform written in python

precious

flat-file wiki software written in python

unblockr