• Stars
    star
    118
  • Rank 299,923 (Top 6 %)
  • Language
    Python
  • Created over 8 years ago
  • Updated almost 7 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Sample of using proxies to crawl baidu search results.

BaiduCrawler

爬取百度搜索结果中c-abstract里的数据,并使用不断更换代理ip的方式绕过百度反爬虫策略,从而实现对数以10w计的词条的百度搜索结果进行连续爬取。

获取代理ip策略

    1. 抓取页面上全部[ip:port]对,并检测可用性(有的代理ip是连不通的)。
    1. 使用"多轮检测"策略,即每个ip要经历N轮,间隔为duration连接测试,每轮都会丢弃连接时间超过timeout的ip。N轮下来,存活的ip都是每次都在timeout范围以内连通的,从而避免了"辉煌的15分钟"效应。

爬取策略

有3个策略:

    1. 每当出现download_error,更换一个IP
    1. 每爬取200条文本,更换一个IP
    1. 每爬取20,000次,更新一次IP资源池

上述参数均可手动调整。 目前ip池的使用都是一次性的,如果需要更多的优质ip,可参考我的另一个项目Proxy,它是一个代理ip抓取测试评估存储一体化工具,也许可以帮到你。

TODO

    1. 对因网络原因未爬取的词进行二次爬取,直到达到用户指定的爬取率
    1. 对爬取速度快的优质ip增加权重,从而形成一个具有优先级的ip池
    1. ip评估改写成多线程

使用

准备工作

pip install requests
pip install lxml
pip install beautifulsoup4

git clone https://github.com/fancoo/BaiduCrawler
cd BaiduCrawler

Python 2.7

python baidu_crawler.py

Python 3

本程序仅在win版本的Python3.6测试通过。

cd Py3
python baidu_crawler.py

2017/5/4更新

  • 原有的判断ip是否有效的网站失效,已替换。
  • 增加更多代理ip网站。
  • 提高可配置性。

2017/6/13更新

  • 新增抓取的代理IP数据存到MySql中 下次先从库中读取 再从网站抓取

2017/6/18更新

  • 修改了部分BoBoGithub提交的PR,并重构了ip_pool.py的代码。
  • 目前这个版本其实只将有效ip保存到数据库,没能实现ip质量评优以及爬取的多线程,因时间精力有限,考虑未来再加入。

2017/7/25更新

  • 增加对Python3.6的支持。

More Repositories

1

Queryable

Run OpenAI's CLIP model on iOS to search photos.
Swift
2,430
star
2

tinymind

Tinymind - Write and sync your blog & thoughts with GitHub
TypeScript
617
star
3

disco-diffusion-wrapper

Implementation of disco-diffusion wrapper that could run on your own GPU with batch text input.
Jupyter Notebook
571
star
4

randomCNN-voice-transfer

Audio style transfer with shallow random parameters CNN.
Python
375
star
5

PodFind

Find what podcasters think of new things: GPT-4, SVB, etc.
JavaScript
149
star
6

Proxy

A simple tool for fetching usable proxies from several websites.
Python
125
star
7

api-usage

Track your OpenAI API token usage & cost.
HTML
58
star
8

WaveGAN-pytorch

PyTorch implementation of " Synthesizing Audio with Generative Adversarial Networks"
Python
57
star
9

teach-show-consult

Teach ChatGPT the Alda music programming language, show it some superb code, and consult with it to compose a melody.
Python
47
star
10

QLearningMouse

Cat-and-Mouse game with Reinforcement Learning (Q-Learning).
Python
24
star
11

make-CelebA-HQ

Supposed you've downloaded CelebA & CelebA-HQ dataset, and want to get HQ images from them.
Python
15
star
12

Manzarek

A tiny bot reposts blind date information from website fanfou.
Python
11
star
13

Disentangled-Sequential-Autoencoder

PyTorch Implementation of Disentangled Sequential Autoencoder
Jupyter Notebook
8
star
14

Focus

Chrome Extension: One-click to batch open websites, double-click to close them.
JavaScript
8
star
15

N-Grams-novel

An English & Chinese novel generator based on N-Grams.
Python
4
star
16

DrQAChinese

Python
3
star
17

mazz.github.io

HTML
1
star
18

mazzzystar.github.io

HTML
1
star
19

MusicGAN

Generate long-term "structure" dependency raw piano audio, result: https://soundcloud.com/mazzzystar/sets/only-1-discriminator-to-control-both-local-long-term
Python
1
star