• This repository has been archived on 21/Jan/2022
  • Stars
    star
    108
  • Rank 319,433 (Top 7 %)
  • Language
    Python
  • Created almost 7 years ago
  • Updated over 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

wechat spiders微信公众号爬虫

wechatPubSpider(已不维护)

微信公众号爬虫,限制一:微信搜狗引擎每个公众号只能爬取10条最近的文章;所以为了爬取喜欢的所有的公众文章。萌发以下思路:要想抓取微信公众号,主要需要两个主要参数:1. __biz(微信公众号id),2.wap_sid2(类似于获取公众号单一文章的权限)。

  • 1.获取公众号id:

       <script type="text/javascript">
       document.domain="qq.com";
       var biz = "MzI3MTA2OTkxNQ==" || "";
       var src = "3" ; 
       var ver = "1" ; 
       var timestamp = "1507726402" ; 
       var signature = "MgPVC3IPsaxxEYqtBq2IOampNVHLxE*D2-f9b*rLZKzoGtUHRNbDczmbZCSVr2xU0so04b-YJ5*pnPENrnDsMg==" ; 
       var name="java--demon"||"xxxx";
       var msgList = {"list":[{"app_msg_ext_info":{"author":"","content":"","content_url":"/s?timestamp=1507726847&amp;src=3&amp;ver=1&amp;signature=7m*l*VL7N2rmoUqDTJ0cU8HGgyZ6W6vz6lCZESAIKyM0FoT7uPgVZghVou*eg9godSOwuIuNLi3tpwgBVaJEIUtJJTebhtJ*I9ld*q8au3PdmTGHiPtNiNqD1RqpDdG25J7*-jP2pSUJIlp8Ygf90XvblaJQpv9RHGCv-Urxljc=","copyright_stat":100,"cover":"http://mmbiz.qpic.cn/mmbiz/xP5fTfMpdGWskgFKqK158QFLCRtvEAqzD5K97yKF7Hd3Gp34JFR0bFrGahRblIfh6eQxcEpDCAnia1I7UIyrL7w/0?wx_fmt=jpeg","del_flag":1,"digest":"111","fileid":403730075,"is_multi":0,"item_show_type":0,"multi_app_msg_item_list":[],"source_url":"","subtype":9,"title":"新年活动"},"comm_msg_info":{"content":"","datetime":1463107392,"fakeid":"3271069915","id":421070031,"status":2,"type":49}},{"app_msg_ext_info":{"author":"","content":"","content_url":"/s?timestamp=1507726847&amp;src=3&amp;ver=1&amp;signature=7m*l*VL7N2rmoUqDTJ0cU8HGgyZ6W6vz6lCZESAIKyM0FoT7uPgVZghVou*eg9godSOwuIuNLi3tpwgBVaJEIT-v9GhM*P30y4ABR7qZCplkPTZPR8fUsx38LmdErF7aPGHE6cTvHYCllVIQS6-rn6VNpzVAO53lVxwGp9*KyME=","copyright_stat":100,"cover":"http://mmbiz.qpic.cn/mmbiz/xP5fTfMpdGWskgFKqK158QFLCRtvEAqzD5K97yKF7Hd3Gp34JFR0bFrGahRblIfh6eQxcEpDCAnia1I7UIyrL7w/0?wx_fmt=jpeg","del_flag":1,"digest":"111","fileid":403730075,"is_multi":0,"item_show_type":0,"multi_app_msg_item_list":[],"source_url":"","subtype":9,"title":"新年活动"},"comm_msg_info":{"content":"","datetime":1453978263,"fakeid":"3271069915","id":403730105,"status":2,"type":49}}]};seajs.use("sougou/profile.js");
    
    
  • 2.wap_sid2单一公众号的所有文章获取的权限值。

    • wap_sid2参数值的获取需要通过对微信手机客户端app进行抓包分析,然后获取其中的权限。从Fiddler中观察,微信客户端进入到单一具体的公众号,获取公众号的历史消息列表。经历了几个网址获取获取的权限值。
  • 3.使用方法:

    • wechatPubSpider/wechatSpider -/wechatSpider/目录下面的settings文件,MONGO_URI,MONGO_DATABASE,MONGO_USER,MONGO_PASS分别填写项目保存数据的地址IP,数据库,用户以及密码。
    • wechatPubSpider/wechatSpider -/wechatSpider/spiders/目录下面settings文件,MONGO_URI,MONGO_DATABASE,MONGO_USER,MONGO_PASS,ARTICLE,WECHATID,RESPONSE(相同的是如上),其中不同的,是数据库里面的collectionName,可以随意起喜欢的名字。
  • 4.启用顺序:

    • 运行wechat.py文件,然后输入关键词,获取微信公众号ID。存入MongoDB数据库。
     $ scrapy crawl wechat
    
    • 运行getSession.py文件,获取权限值wap_sid2,并把biz与wap_sid2对应入库。
     $ scrapy crawl getSession
    
    • 运行data.py文件,爬取数据,存入数据库。
     $ scrapy crawl data
    
    • crawlResult

如果遇到异常,可以通过抓包,获取请求头信息的X-wechat-key值。这个值有时间限制,大概几分钟更新一次。所以可以手动去更新该值,如果想了解更多,可以浏览 微信公众号文章爬虫

More Repositories

1

miniProgram

猫眼电影/Taro/微信小程序/React
JavaScript
191
star
2

toutiao

今日头条科技新闻接口爬虫
Python
17
star
3

react-admin-system

基于React开发后台管理系统模板(Ant Design)
JavaScript
14
star
4

QQMusicPlayerWebApp

高仿QQ音乐移动端(webapp)
Vue
11
star
5

QQMusicCrawler

爬取QQ音乐所有的歌手以及歌手下的歌曲URL地址
Python
9
star
6

simple-redux

react-hooks、context打造一个简易redux
JavaScript
8
star
7

zhihu-server

基于koa2, mongodb, redis技术栈,开发类似于知乎社区的restful服务
TypeScript
7
star
8

learnNote

学习前端笔记,整理笔记
7
star
9

duSheCommunity

利用Fiddler抓包分析毒舌影评社区的APP api接口。单机版的scrapy爬虫,基于scrapy-redis
Python
6
star
10

distributedCrawler

基于multiprocessing多进程的Managers模块的进程队列消息共享
Python
5
star
11

vill-directive

JavaScript
4
star
12

vill-Loading

JavaScript
2
star
13

electron-vite-react-starter

Starter using Vite2 + React17 + Typescript4 + Electron12 .
TypeScript
2
star
14

bilibili-flutter

基于flutter的高仿B站安卓APP--bilibili-flutter
Dart
2
star
15

vill-messgae

基于Vue2.52的message消息提示框
JavaScript
2
star
16

electron-music-desktop

基于electron-vue开发的跨平台QQ音乐PC端,兼容linux,windows,darwin
Vue
2
star
17

auto-pilot-deploy

auto-pilot-deploy 是一个前端部署平台,用作前端静态资源和 Nodejs 的服务部署
TypeScript
1
star
18

hat-cli

JavaScript
1
star
19

Harhao

1
star
20

redux-example

JavaScript
1
star
21

mobx-demo

mobx todoList
JavaScript
1
star
22

slide

JavaScript
1
star
23

dva-admin

基于dva2,ant-design构建的后台管理系统
JavaScript
1
star
24

simple-vue

JavaScript
1
star
25

micro-video

基于React + typrscript 开发短视频移动端应用
TypeScript
1
star
26

Harhao.github.io

HTML
1
star
27

juejin-markdown-theme-hey-rain

掘金markdown主题
SCSS
1
star