zyearn/zhihuCrawler

Stars
166
Rank 227,748 (Top 5 %)
Language
C++
Created over 10 years ago
Updated over 5 years ago

zyearn

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

event-driven crawler implemented by C++

知乎爬虫

介绍

ZhihuCrawler是用C++编写的高效、基于事件驱动的知乎爬虫，目的是抓取最高赞回答、最高关注问题等数据。运行环境为支持epoll的平台。

使用

先找到浏览器访问知乎的cookie，将它复制到src/confic.cc下的cookie变量里。

编辑./startfile/seeds.txt, 将从这个文件指定的用户URL开始爬。

make
./zhihuCrawler

可以访问http://localhost:8080来查看爬虫的状态。

输出

爬下的数据都存储在./datafile/rawData.raw下。使用

./sort.sh

可以查看根据票数排序后的结果。

TODO

~~增加ajax获取用户的全部关注人和关注者~~
降低模块间耦合度
用代理IP处理429错误/IP被封

// 用C/C++写爬虫真是做大死

zaver

Yet another fast and efficient HTTP server

850

6.828-labs

my labs answer for https://pdos.csail.mit.edu/6.828/2014/

csapp-lab-2e

eleme-hackathon

6.828-hw

homework and exercise repo for 6.828(https://pdos.csail.mit.edu/6.828/2014/index.html)

algs4

exercise repo for http://algs4.cs.princeton.edu

Java

TCNVMalloc

TCNVMalloc is an efficient wear-aware allocator for Non-Volatile Memory

TeX

vsbnat

visit server behind NAT

JavaScript

tbus

easy-math-calculator

Python

algo_training

C++

SmallC

yet another compiler for SmallC(a subset of C)

csapp-lab-3e

zyearn.github.io

ml-ang

zyearn/zhihuCrawler

zyearn

Reviews

Repository Details

知乎爬虫

介绍

使用

输出

TODO

更多

More Repositories