mazzzystar/BaiduCrawler

Stars
118
Rank 299,923 (Top 6 %)
Language
Python
Created over 8 years ago
Updated almost 7 years ago

mazzzystar/BaiduCrawler

mazzzystar

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Sample of using proxies to crawl baidu search results.

BaiduCrawler

爬取百度搜索结果中c-abstract里的数据，并使用不断更换代理ip的方式绕过百度反爬虫策略，从而实现对数以10w计的词条的百度搜索结果进行连续爬取。

获取代理ip策略

1. 抓取页面上全部[ip:port]对，并检测可用性（有的代理ip是连不通的）。
1. 使用"多轮检测"策略，即每个ip要经历N轮，间隔为duration连接测试，每轮都会丢弃连接时间超过timeout的ip。N轮下来，存活的ip都是每次都在timeout范围以内连通的，从而避免了"辉煌的15分钟"效应。

爬取策略

有3个策略：

1. 每当出现download_error，更换一个IP
1. 每爬取200条文本，更换一个IP
1. 每爬取20,000次，更新一次IP资源池

上述参数均可手动调整。目前ip池的使用都是一次性的，如果需要更多的优质ip，可参考我的另一个项目Proxy,它是一个代理ip抓取测试评估存储一体化工具，也许可以帮到你。

TODO

1. 对因网络原因未爬取的词进行二次爬取，直到达到用户指定的爬取率
1. 对爬取速度快的优质ip增加权重，从而形成一个具有优先级的ip池
1. ip评估改写成多线程

使用

准备工作

pip install requests
pip install lxml
pip install beautifulsoup4

git clone https://github.com/fancoo/BaiduCrawler
cd BaiduCrawler

Python 2.7

python baidu_crawler.py

Python 3

本程序仅在win版本的Python3.6测试通过。

cd Py3
python baidu_crawler.py

2017/5/4更新

原有的判断ip是否有效的网站失效，已替换。
增加更多代理ip网站。
提高可配置性。

2017/6/13更新

新增抓取的代理IP数据存到MySql中下次先从库中读取再从网站抓取

2017/6/18更新

修改了部分BoBoGithub提交的PR，并重构了ip_pool.py的代码。
目前这个版本其实只将有效ip保存到数据库，没能实现ip质量评优以及爬取的多线程，因时间精力有限，考虑未来再加入。

2017/7/25更新

增加对Python3.6的支持。

Queryable

Run OpenAI's CLIP model on iOS to search photos.

tinymind

Tinymind - Write and sync your blog & thoughts with GitHub

disco-diffusion-wrapper

Implementation of disco-diffusion wrapper that could run on your own GPU with batch text input.

Jupyter Notebook

randomCNN-voice-transfer

Audio style transfer with shallow random parameters CNN.

PodFind

Find what podcasters think of new things: GPT-4, SVB, etc.

Proxy

A simple tool for fetching usable proxies from several websites.

api-usage

Track your OpenAI API token usage & cost.

WaveGAN-pytorch

PyTorch implementation of " Synthesizing Audio with Generative Adversarial Networks"

teach-show-consult

Teach ChatGPT the Alda music programming language, show it some superb code, and consult with it to compose a melody.

QLearningMouse

Cat-and-Mouse game with Reinforcement Learning (Q-Learning).

make-CelebA-HQ

Supposed you've downloaded CelebA & CelebA-HQ dataset, and want to get HQ images from them.

Manzarek

A tiny bot reposts blind date information from website fanfou.

Disentangled-Sequential-Autoencoder

PyTorch Implementation of Disentangled Sequential Autoencoder

Jupyter Notebook

Focus

Chrome Extension: One-click to batch open websites, double-click to close them.

N-Grams-novel

An English & Chinese novel generator based on N-Grams.

DrQAChinese

mazz.github.io

mazzzystar.github.io

MusicGAN

Generate long-term "structure" dependency raw piano audio, result: https://soundcloud.com/mazzzystar/sets/only-1-discriminator-to-control-both-local-long-term