• Stars
    star
    258
  • Rank 154,844 (Top 4 %)
  • Language
    Python
  • Created about 7 years ago
  • Updated about 7 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Update Version of weibo_terminator, This is Workflow Version aim at Get Job Done!

Weibo Terminator Work Flow

PicName

这个项目是之前项目的重启版本,之前的项目地址这里,那个项目依旧会保持更新,这是weibo terminator的工作版本,这个版本对上一个版本做了一些优化,这里的最终目标是一起爬取语料,包括情感分析、对话语料、舆论风控、大数据分析等应用。

UPDATE 2017-5-16

更新:

  • 调整了首次cookies获取逻辑,如果程序没有检测到cookies就会退出,防止后面爬取不到更多的内容而crash;
  • 增加了WeiBoScraperM 类,目前还在构建中,欢迎submit PR 实现,这个类主要实现从另外一个微博域名爬取,也就是手机域名;

大家可以pull一下更新。

UPDATE 2017-5-15

经过一些小修改和几位contributor的PR,代码发生了一些小变化,基本上都是在修复bug和完善一些逻辑,修改如下:

  1. 修复了保存出错的问题,这个大家在第一次push的时候clone的代码要pull一下;
  2. 关于 WeiboScraper has not attribute weibo_content的错误,新代码已经修复;

@Fence 提交PR修改了一些内容:

  1. 原先的固定30s休息换成随机时间,具体参数可自己定义
  2. 增加了big_v_ids_file,记录已经保存过粉丝的明星id; 用txt格式,方便contributor手动增删
  3. 两个函数的爬取页面都改成了page+1,避免断点续爬时重复爬取上次已经爬过最后一页
  4. 把原先的“爬取完一个id的所有微博及其评论”改为“爬完一条微博及其所有评论就保存”
  5. (Optional)把保存文件的部分单独为函数,因为分别有2个和3个地方需要保存

大家可以git pull origin master, 获取一下新更新的版本,同时也欢迎大家继续问我要uuid,我会定时把名单公布在contirbutor.txt 中,我近期在做数据merge的工作,以及数据清洗,分类等工作,merge工作完成之后会把大数据集分发给大家。

Improve

对上一版本做了以下改进:

  • 没有了太多的distraction,直奔主题,给定id,获取该用户的所有微博,微博数量,粉丝数,所有微博内容以及评论内容;
  • 和上一版本不同的是,这一次我们的理念是把所有数据保存到三个pickle文件中,以字典的文件存储,这么做的目的是方便断点续爬;
  • 同时做到了,已经爬过的id爬虫不会再次爬取,也就是说爬虫会记住爬取过的id,每个id获取完了所有内容之后会被标记为已经爬取;
  • 除此之外,微博内容和微博评论被单独分开,微博内容爬取过程中出现中断,第二次不会重新爬取,会从中断的页码继续爬取;
  • 更加重要的是!!!每个id爬取互不影响,你可以直接从pickle文件中调取出任何你想要的id的微博内容,可以做任何处理!!
  • 除此之外之外,测试了新的反爬策略,采用的延时机制能够很好的工作,不过还无法完全做到无人控制。

更更加重要的是!!!,在这一版本中,爬虫的智能性得到了很大提升,爬虫会在爬取每个id的时候,自动去获取该id的所有粉丝id!! 相当于是,我给大家的都是种子id,种子id都是一些明星或者公司或者媒体大V的id,从这些种子id你可以获取到成千上万的其他种子id!! 假如一个明星粉丝是3.4万,第一次爬取你就可以获得3.4万个id,然后在从子代id继续爬,每个子代id有粉丝100,第二次你就可以获取到340万个id!!!足够了吗?!!!当然不够!!!

我们这个项目永远不会停止!!! 会一直进行下去,直到收获足够多的语料!!!

(当然实际上我们不能获得所有粉丝,不过这些也足够了。)

PicName

Work Flow

这一版本的目标是针对contributor,我们的工作流程也非常简单:

  1. 获取uuid,这个uuid可以调取到 distribute_ids.pkl 的2-3个id,这个是我们的种子id,当然大家也可以直接获取到所有id,但是为了防止重复工作,建议大家向我申请一个uuid,你只负责你的那个,爬完之后,把最终文件反馈给我,我整理去重之后,把最终的大语料发放给大家。
  2. 运行 python3 main.py uuid,这里说明一下,uuid指定的id爬取完之后才会取爬fans id;
  3. Done!

Discuss

依旧贴出一下讨论群,欢迎大家添加:

QQ
AI智能自然语言处理: 476464663
Tensorflow智能聊天Bot: 621970965
GitHub深度学习开源交流: 263018023

微信可以加我好友: jintianiloveu

Copyright

(c) 2017 Jin Fagang & Tianmu Inc. & weibo_terminator authors LICENSE Apache 2.0

More Repositories

1

tensorflow_poems

中文古诗自动作诗机器人,屌炸天,基于tensorflow1.10 api,正在积极维护升级中,快star,保持更新!
Python
3,595
star
2

yolov7_d2

🔥🔥🔥🔥 (Earlier YOLOv7 not official one) YOLO with Transformers and Instance Segmentation, with TensorRT acceleration! 🔥🔥🔥
Python
3,116
star
3

weibo_terminater

Final Weibo Crawler Scrap Anything From Weibo, comments, weibo contents, followers, anything. The Terminator
Python
2,303
star
4

alfred

alfred-py: A deep learning utility library for **human**, more detail about the usage of lib to: https://zhuanlan.zhihu.com/p/341446046
Python
854
star
5

DCNv2_latest

DCNv2 supports decent pytorch such as torch 1.5+ (now 1.8+)
C++
564
star
6

keras_frcnn

Keras Implementation of faster-rcnn
Python
521
star
7

thor

thor: C++ helper library, for deep learning purpose
C++
264
star
8

tensorflow_novelist

模仿莎士比亚创作戏剧!屌炸天的是还能创作金庸武侠小说!快star,保持更新!!
Python
259
star
9

faceswap_pytorch

Deep fake ready to train on any 2 pair dataset with higher resolution
Python
242
star
10

nb

Neural Network Blocks - Collect all kinds of fancy model blocks for you to build more powerful neural network model.
Python
231
star
11

pytorch_chatbot

A Marvelous ChatBot implement using PyTorch.
Python
226
star
12

CenterNet_Pro_Max

Experiments based on CenterNet (more backbones, TensorRT deployment and mask head)
221
star
13

LSTM_learn

a implement of LSTM using Keras for time series prediction regression problem
Python
214
star
14

AI-Infer-Engine-From-Zero

关于自建AI推理引擎的手册,从0开始你需要知道的所有事情
206
star
15

movenet

Google's Next Gen Pose Estimation in PyTorch
Python
122
star
16

Spider12306

基于Python3的12306抢票爬虫,10个线程开抢,智能过滤凌晨12:00到7:00发车的车次。
Python
106
star
17

pytorch_image_classifier

Minimal But Practical Image Classifier Pipline Using Pytorch, Finetune on ResNet18, Got 99% Accuracy on Own Small Datasets.
Python
106
star
18

kitti-ssd

Train your own data using SSD in a more clear and simple way(not include source code)
Python
101
star
19

TrafficLightsDetection

using SSD and caffe detect traffic lights on LISA dataset
Python
99
star
20

nosmpl

Accelerated SMPL operation, commonly used in generate 3D human mesh, STAR included.
Python
93
star
21

ssds_pytorch

Multiple basenet MobileNet v1,v2, ResNet combined with SSD detection method and it's variants such as RFB, FSSD etc.
Python
80
star
22

simpleocv

Make a minimal OpenCV runable on any where, WIP
C++
72
star
23

yolov3_tf2

Yolov3 implemented with brand new TensorFlow 2.0 API (both train and prediction)
Python
67
star
24

yolov7-face

Next Gen Face detection based on YOLOv7
Python
55
star
25

FruitsNutsSeg

detectron2 support self-define data train
Python
50
star
26

onnxexplorer

Explorer for ONNX, this tool will help you take a deep inside look of any ONNX model.
Python
44
star
27

Q-Learning

An C++ Version of Q-Learning, to Train Robot Play with Flappybird!!
C++
40
star
28

cityscapestococo

This repo contains usable code convert cityscapes to coco format (Detectron and maskrcnn-benchmark were all broken)
Python
36
star
29

tfboys

TensorFlow and Pytorch practice codes with purity and simplicity.
Python
34
star
30

OpenHandMocap

Python
33
star
31

textfrontend

单独维护的中文TTS
Python
31
star
32

bboxer

Pure, Simple yet Powerful Image Bound Box Making Tool, already cross platform, welcome star and keep updating.
C++
31
star
33

fpn_rssd

Rotated Box SSD detection Framework with FPN support, next generation object detection framework
Python
29
star
34

aural

A Tiny Project For ASR model training and Deployment
Python
28
star
35

Shadowless

A Fast and Open Source Autonomous Perception System.
C++
27
star
36

awesome_transformer

A curated list of transformer learning materials, shared blogs, technical reviews.
26
star
37

pt_mobilenetv2_deeplabv3

Fast accurate realtime segmentation with DeepLabV3 and MobileNetV2 backbone
Python
26
star
38

pytorch_cycle_gan

CycleGAN with Productive Generate APIs. Generate Any Image from Your Transfer Model.
Python
26
star
39

yolovn

Just another yolo variant.
25
star
40

wanwu_release

Wanwu models release, code will be released soon
23
star
41

spconv

Pytorch layer needed by Second Lidar detector.
C++
23
star
42

3d_detection_kit

Toolkit to Explore 3D data for 3D object detection, point cloud visualization, bev map gen etc. Using KITTI as dummy data
Python
22
star
43

pytorch_image_caption

Image Caption, Show and Tell.
Python
20
star
44

datasets

A Collection of Datasets.
19
star
45

pilgrim_torch2trt

Pilgrim Project: torch2trt, quick convert your pytorch model to TensorRT engine.
C++
19
star
46

yolov5_mask

Try add Instance Segmentation upon YoloV5
Python
18
star
47

libnms

libnms.so for object detection, can be use in libtorch or caffe or nccn or onnx or TensorRT
Cuda
17
star
48

pt_enet

Realtime segmentation with ENet, the fast and accurate segmentation net.
Python
14
star
49

GreatDarkNet

An Edit Version of darknet, and this version you can train and predict on your own datasets! more easily!
C
14
star
50

VIBE_yolov5

Using YOLOv5 as detection on VIBE
Python
13
star
51

daybreak_release

Daybreak APP release
12
star
52

gofind

gofind - your personal find helper
Go
12
star
53

cabinet

Cabinet, The Ultimate Tool Box.
Rust
12
star
54

tensorflow_yolov3

A Detailed and Optimized Implementation of Yolo-V3 in Original TensorFlow.
Python
12
star
55

pytorch_name_net

A NetWork Generate Names, Based On Conditional RNN, Set Condition And Generate Different Names.
Python
11
star
56

tensorflow_extractor

State-of-art and Reliable Text-summary and Information Extraction
Python
11
star
57

RetinaNet

Pytorch Implementation of RetinaNet with CUDA accelerate nms operation.
Python
10
star
58

gluon_ssd

Implement SSD using Gluon in only 300 lines of codes!
Python
10
star
59

m

m editor is a modern, easy to use, fast terminal editor then vim or emacs. written in pure Rust.
Rust
10
star
60

wnnx_models

Various test models in WNNX format. It can view with `pip install wnetron && wnetron`
10
star
61

seg_icnet

ICNet in TensorFlow, Real-Time Segmentation
Python
10
star
62

fusion

Fusion package with transformation between camera and lidar, IMU etc. Autonomous and robot helper.
Python
9
star
63

scraper_toolbox

Python3.6 Scraper Toolbox, You can almost Scrap Anything in this Repo, Welcome Pull Request
Python
9
star
64

blackpearl

The Black Pearl in Golang. Personal Assistant.
Go
9
star
65

mxnet_tiny5

mxnet训练自己的数据集分类,支持模型断点训练和预测单张图片
Python
9
star
66

TTS_CN

A Chinese TTS System!
Python
9
star
67

arxiv_action

企业微信机器人或钉钉机器人定制服务,自动推送arxiv最新paper
Python
9
star
68

papers

Contains many papers with categories in CV, NLP, RL Quantum etc.
8
star
69

pytorch_style_transfer

A Simple Implementation of Neural Style Transfer using Pytorch. You can generate your own art pictures now!
Python
8
star
70

tacotron

TensorFlow implementation of Google Tacotron. Train on Audio and Generate Speech using Text. Which can be Called TTS.
Python
8
star
71

yolov8

7
star
72

numgo

NumPy library in Go.
Go
7
star
73

LLaVA-Magvit2

Python
7
star
74

PoseAILiveLink

PoseAI LiveLink Compatible on macOS
C++
7
star
75

gooooup

Upload load images(files) to cloud, generate permanent link.
Go
7
star
76

mjolnir

Light weighted replacement of original thor C++ library. More simpler, more clean, more light.
C++
7
star
77

UbuntuScripts

Shell
7
star
78

MLLM_Factory

A Dead Simple and Modularized Multi-Modal Training and Finetune Framework. Compatible to any LLaVA/Flamingo/QwenVL/MiniGemini etc series models.
7
star
79

visiontransformers

Vision Transformers that you need.
Python
6
star
80

sherpa_ort

ONNXRuntime ASR C++
C++
6
star
81

minitr

Exploration on Micro Transformers, Unleash the power of mini-transformers!
Python
6
star
82

tensorflow_wgan

A Tensorfow Version of the state-of-art Wasserstein GAN, image super resolution, black image colorful, more function are applying...just star!
Python
6
star
83

tensorflow_classifier

Simple and over-through process for Tensorflow classify images, using own dataset
Python
6
star
84

mxnet_ssd

Another maintained mxnet ssd version
Python
6
star
85

caffe_tiny5

Caffe tutorial for train own data and predict using python
Python
6
star
86

mmc

Next Gen MMD runs on all platforms, Windows, Linux, Mac. Will support exchange between vmd and fbx format.
C++
6
star
87

realrender

3D mesh render without pain.
C++
6
star
88

squeezeseg_pytorch

Realtime Point Cloud Segmentation
Python
6
star
89

vits_cpp

C++ and ONNXRuntime based VITS voice synthesis
C++
6
star
90

tf_pose_realtime

Realtime Openpose with MobileNetV2 backend
PureBasic
6
star
91

AwesomeLLM

6
star
92

mono_odometry

Visual Odometry Using Mono Camera
C++
6
star
93

person_tracking

person tracking in ros
C++
5
star
94

sparrow

The message server in Golang, like WeChat.
JavaScript
5
star
95

algorithm

Contains all kinds of algorithm write in Python and C++, some with Rust.
Python
5
star
96

CaffeHandsOn

This is a Caffe hands on tutorial.
Python
5
star
97

sak

Swiss Army Knife for secret hacking and sniffering
Go
5
star
98

instance_seg_tf

Instance Segmentation with discriminate loss
Python
5
star
99

efficientformers

Collection of efficient transformers.
Python
5
star
100

mumoda

Library to lean big models combined with Text and Image. And then Diffusion!
Python
5
star