• Stars
    star
    322
  • Rank 129,670 (Top 3 %)
  • Language
    Python
  • Created over 2 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

中文文本分类任务,基于PyTorch实现(TextCNN,TextRNN,FastText,TextRCNN,BiLSTM_Attention, DPCNN, Transformer,Bert,ERNIE),开箱即用!

Chinese-Text-Classification

中文文本分类,基于pytorch,开箱即用。

  • 神经网络模型:TextCNN,TextRNN,FastText,TextRCNN,BiLSTM_Attention, DPCNN, Transformer

  • 预训练模型:Bert,ERNIE

介绍

神经网络模型

模型介绍、数据流动过程:参考

数据以字为单位输入模型,预训练词向量使用 搜狗新闻 Word+Character 300d点这里下载

模型 介绍
TextCNN Kim 2014 经典的CNN文本分类
TextRNN BiLSTM
TextRNN_Att BiLSTM+Attention
TextRCNN BiLSTM+池化
FastText bow+bigram+trigram, 效果出奇的好
DPCNN 深层金字塔CNN
Transformer 效果较差

预训练模型

模型 介绍 备注
bert 原始的bert
ERNIE ERNIE
bert_CNN bert作为Embedding层,接入三种卷积核的CNN bert + CNN
bert_RNN bert作为Embedding层,接入LSTM bert + RNN
bert_RCNN bert作为Embedding层,通过LSTM与bert输出拼接,经过一层最大池化层 bert + RCNN
bert_DPCNN bert作为Embedding层,经过一个包含三个不同卷积特征提取器的region embedding层,可以看作输出的是embedding,然后经过两层的等长卷积来为接下来的特征抽取提供更宽的感受眼,(提高embdding的丰富性),然后会重复通过一个1/2池化的残差块,1/2池化不断提高词位的语义,其中固定了feature_maps,残差网络的引入是为了解决在训练的过程中梯度消失和梯度爆炸的问题。 bert + DPCNN

参考:

环境

python 3.7
pytorch 1.1
tqdm
sklearn
tensorboardX
pytorch_pretrained_bert(预训练代码也上传了, 不需要这个库了)

中文数据集

我从THUCNews中抽取了20万条新闻标题,已上传至github,文本长度在20到30之间。一共10个类别,每类2万条。数据以字为单位输入模型。

类别:财经、房产、股票、教育、科技、社会、时政、体育、游戏、娱乐。

数据集划分:

数据集 数据量
训练集 18万
验证集 1万
测试集 1万

更换数据集

  • 按照THUCNews数据集的格式来格式化自己的中文数据集。
  • 对于神经网络模型:
    • 如果用字,按照数据集的格式来格式化你的数据。
    • 如果用词,提前分好词,词之间用空格隔开,python run.py --model TextCNN --word True
    • 使用预训练词向量:utils.py的main函数可以提取词表对应的预训练词向量。

实验效果

机器:一块2080Ti , 训练时间:30分钟。

模型 acc 备注
TextCNN 91.22% Kim 2014 经典的CNN文本分类
TextRNN 91.12% BiLSTM
TextRNN_Att 90.90% BiLSTM+Attention
TextRCNN 91.54% BiLSTM+池化
FastText 92.23% bow+bigram+trigram, 效果出奇的好
DPCNN 91.25% 深层金字塔CNN
Transformer 89.91% 效果较差
bert 94.83% 单纯的bert
ERNIE 94.61% 说好的中文碾压bert呢
bert_CNN 94.44% bert + CNN
bert_RNN 94.57% bert + RNN
bert_RCNN 94.51% bert + RCNN
bert_DPCNN 94.47% bert + DPCNN

原始的bert效果就很好了,把bert当作embedding层送入其它模型,效果反而降了,之后会尝试长文本的效果对比。

预训练语言模型

bert模型放在 bert_pretain目录下,ERNIE模型放在ERNIE_pretrain目录下,每个目录下都是三个文件:

  • pytorch_model.bin
  • bert_config.json
  • vocab.txt

预训练模型下载地址:

bert_Chinese: 模型 https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese.tar.gz
词表 https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese-vocab.txt

来自这里

备用:模型的网盘地址:https://pan.baidu.com/s/1qSAD5gwClq7xlgzl_4W3Pw

ERNIE_Chinese: https://pan.baidu.com/s/1lEPdDN1-YQJmKEd_g9rLgw

来自这里

解压后,按照上面说的放在对应目录下,文件名称确认无误即可。

使用说明

神经网络方法

# 训练并测试:
# TextCNN
python run.py --model TextCNN

# TextRNN
python run.py --model TextRNN

# TextRNN_Att
python run.py --model TextRNN_Att

# TextRCNN
python run.py --model TextRCNN

# FastText, embedding层是随机初始化的
python run.py --model FastText --embedding random 

# DPCNN
python run.py --model DPCNN

# Transformer
python run.py --model Transformer

预训练方法

下载好预训练模型就可以跑了:

# 预训练模型训练并测试:
# bert
python pretrain_run.py --model bert

# bert + 其它
python pretrain_run.py --model bert_CNN

# ERNIE
python pretrain_run.py --model ERNIE

预测

预训练模型:

python pretrain_predict.py

神经网络模型:

python predict.py

参数

模型都在models目录下,超参定义和模型定义在同一文件中。

参考

论文

[1] Convolutional Neural Networks for Sentence Classification

[2] Recurrent Neural Network for Text Classification with Multi-Task Learning

[3] Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification

[4] Recurrent Convolutional Neural Networks for Text Classification

[5] Bag of Tricks for Efficient Text Classification

[6] Deep Pyramid Convolutional Neural Networks for Text Categorization

[7] Attention Is All You Need

[8] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

[9] ERNIE: Enhanced Representation through Knowledge Integration

仓库

本项目基于以下仓库继续开发优化:

More Repositories

1

Audio-Digital-Processing

数字信号处理大作业:Matlab实现语音分析:加噪声,频谱分析,滤波器等等(内附报告)【Matlab for speech analysis: add noise, spectrum analysis, filter, etc】
MATLAB
121
star
2

PKU-Lessons-Summary

北京大学软件与微电子学院硕士生课程知识点、作业等汇总【Summary of Knowledge Points and Assignments of Peking University Integrated Circuit Major Courses】
119
star
3

Computer-Generated-Hologram

🎞数字全息术与计算机生成全息的计算与仿真框架【This library introduces the current production process of computer holography, and uses MATLAB and Python to record and reproduce holograms. In the future, I will build a computer hologram simulation framework or a simulation application program.】
Python
118
star
4

miniGame

利用C++实现的小游戏:2048,俄罗斯方块,贪吃蛇,飞机大战[Some small games implemented in C++: 2048, Tetris , Snake, Plane War (University Programming)]
C++
105
star
5

Awesome-Uplift-Model

🛠 How to Apply Causal ML to Real Scene Modeling?How to learn Causal ML?【✔从Causal ML到实际场景的Uplift建模】
Jupyter Notebook
101
star
6

Chinese-Keyphrase-Extraction

无监督中文关键词抽取(Keyphrase Extraction),基于统计,基于图【LDA与PageRank(TextRank, TPR, Salience Rank, Single TPR等)】,基于嵌入【SIFRank等】,开箱即用!
Python
100
star
7

USTB-miniPaper

北京科技大学计算机与通信工程学院大学相关课程报告汇总[Summary of University-related Course Report of Beijing University of Science and Technology]
63
star
8

Chinese-Tokenization

利用传统方法(N-gram,HMM等)、神经网络方法(CNN,LSTM等)和预训练方法(Bert等)的中文分词任务实现【The word segmentation task is realized by using traditional methods (n-gram, HMM, etc.), neural network methods (CNN, LSTM, etc.) and pre training methods (Bert, etc.)】
Python
32
star
9

Photo-Edit

利用pyQt5完成的GUI简易的图像编辑器,包括滤镜,亮度对比度锐化处理,旋转翻转,更改图片尺寸等操作。(内附报告)[A simple GUI editor using pyQt5, including filters, brightness contrast sharpening, rotation flipping, and changing image size. At the same time (with a report attached)]
Python
28
star
10

Graduation-Design

😁【北京市优秀毕业论文】基于车辆轨迹时空数据的城市热点预测模型研究【Urban hot spot prediction model based on spatiotemporal data of vehicle trajectory】
Python
25
star
11

Apriori-and-FP_Growth

数据挖掘:Apriori算法与FP-Growth算法实现对比(Data Mining: Apriori Algorithm vs. FP-Growth Algorithm)
Python
23
star
12

Awesome-DL-Models

🤩Learning and reproducing classic deep learning models by using PyTorch.【Machine Learning,CV,NLP,Mutimodal,GNN,etc.】
Python
21
star
13

Elimination-Game

利用pygame实现消消乐小游戏GUI界面(Use pygame to eliminate the GUI interface of music game)
Python
20
star
14

SkyDream

个人动态星空网站:天文爱好者(Personal dynamic star website: astronomy enthusiasts)
HTML
19
star
15

Dijkstra-bjSubway

北京地铁计费系统:离散数学大作业(内附报告)[Beijing Metro Billing System: Discrete Mathematics (with report)]
Python
19
star
16

Digital-Integrated-Circuit-Design

北京大学数字集成电路设计课程作业—FPGA设计【Assignment of digital integrated circuit design course of Peking University】
Verilog
19
star
17

Routing-Algorithm

MATLAB实现路由算法基本原理(内附报告)[MATLAB realizes the basic principle of routing algorithm - mathematics experiment work (with report)]
MATLAB
18
star
18

TheAlgorithm

Matlab实现的一些数学基础算法(Some mathematical basic algorithms implemented by Matlab)
MATLAB
15
star
19

Draw-Pikaqiu

利用turtle库绘画的眨眼睛的皮卡丘~(Blinking Pikachu using the turtle library painting~)
Python
15
star
20

Mini-Tools

Python实现一些小道具小功能(Python implements some small props)
Python
15
star
21

Arxiv-NLP-Reporter

每日自动获取Arxiv上NLP相关最新论文【Arxiv Natural Language Processing Paper Automatic Crawl Daily】
Python
15
star
22

Word-Counting

利用jieba库对中文小说进行词频统计并进行简单的正则匹配,同时验证Zipf-Law(Use the jieba library to perform word frequency statistics on Chinese novels and perform simple regular matching, and verify Zipf-Law)
Python
14
star
23

flybeike

静态网页——贝壳航模(The first personal small project: static webpage - shell model)
HTML
14
star
24

Rec-Models

📝 Summary of recommendation, advertising and search models.【推广搜技术汇总⭐】
Python
13
star
25

GUI-bjSubway

北京地铁费用系统网页版:GUI可视化界面(Beijing Metro Expense System Web Edition: GUI Visualization Interface)
JavaScript
12
star
26

KeyWord-Crawler

通过输入关键词动态爬取图片保存于本地(Dynamically crawl images by entering keywords and save them locally)
Python
12
star
27

SlideShow

python照片墙设计,将爬虫获取的照片布局成爱心形状~( Python photo wall design, layout photos taken by reptiles into a love shape~)
Python
12
star
28

torch2018

北京科技大学芯炬社会实践团网页模型~(Beijing University of Science and Technology Core Torch Social Practice Group webpage model~)
HTML
11
star
29

Quantum-Neural-Network

💫Implement Quantum Inspired Neural Network(QINN,QICNN,etc.)
Python
11
star
30

BabyCare-Hardware

BabyCare项目硬件Arduino控制代码(BabyCare project Arduino hardware control code)
C
10
star
31

Web-SaferGo

SaferGo官网模型,了解产品的基本概况(SaferGo official website model to understand the basic overview of the product)
CSS
10
star
32

SaferGo_Service

SaferGo软件远程代码(SaferGo Software Server Code)
JavaScript
10
star
33

JackHCC

🤡 Personal Profile
9
star
34

Embedded-Microprocessor-System-Homework

Peking University Embedded Microprocessor System Lesson’s all Homework
Assembly
8
star
35

API-for-PyTorch

PyTorch中文文档代码样例说明(超详细)【Example description of pytorch Chinese document code】
Python
8
star
36

Correspondence-Principle-Experiment-for-LabView

Labview 实现AM调制,AMI码,HDB3,CMI,双相码以及2ASK,2FSK,2PSK和DPSK调制(内附报告)【Communication principle experiment LabVIEW realizes AM modulation, AMI code, HDB3, CMI, biphase code and 2ASK, 2FSK, 2PSK and DPSK modulation】
8
star
37

JackHCC.github.io

JackHCC Personal Blog
HTML
7
star
38

EGo1-Vivado-Lock

FPGA设计,借助Vivado和Ego1实验平台设计的密码锁【内附报告】(FPGA design, password lock designed by vivado and ego1 experimental platform [attached report])
Verilog
6
star
39

Computer-Vision-And-Augmented-Reality-Homework

Homework of Computer Vision And Augmented Reality Lesson
Python
6
star
40

Pcode-Similarity

二进制代码相似性检测算法(Algorithm for calculating similarity between function and library function.)
Java
6
star
41

NLP-Bubble

🖨 Natural Language Processing Learning Blog,a Study Bubble to recording learning.
5
star
42

Awesome-Binary-Code-Similarity-Detection-2021

Awesome list for Binary Code Similarity Detection in 2021
5
star
43

Fucking-Keng

📗 排坑索引,快速避坑指南【KengBook-a book to index errors and solutions.】
4
star
44

Keyphrase-Extraction

Keyphrase Extraction by using Topic PageRank(TextRank, TPR, Salience Rank, Single TPR)【基于 Topic PageRank的关键词抽取】
Python
4
star
45

Tracking-Car-for-Arduino

Arduino控制八路循迹小车实现自动循迹和投放物块以及到达终点OLED屏显示【Arduino controls 8-way tracking car to realize automatic tracking, object placement and OLED screen display of the destination】
C++
4
star
46

BabyBox

远程照看与数据监测及婴儿生态系统模型——比赛测试版(Remote care and data monitoring and infant ecosystem model - competition beta)
Java
3
star
47

JackHCC.Toolkit

Creative.cc - JackHCC Toolkit, An efficient navigation.🛠
HTML
2
star
48

How-to-understand-Distribution

The basic distribution probability for Deep Learning
Python
2
star
49

MoJio

🎈MoJio is a relax web~
JavaScript
1
star
50

SaferGo

SaferGo一款集成出行安全与出行便携的伴随式应用——比赛测试版(Safergo: a companion application integrating travel safety and portability -- competition test version)
Java
1
star
51

iCreate-GuitarSimulator

iCreate-让创作贴近生活:模拟吉他版本(基于Coffee)【icreate - making creation close to life: simulated guitar version (based on coffee)】
JavaScript
1
star
52

DSP-for-Conv-and-FFT

DSP软件仿真实现卷积和快速傅里叶变换(FFT)加高斯白噪声(DSP software simulation of convolution and FFT with Gaussian white noise)
Makefile
1
star