• Stars
    star
    6,526
  • Rank 6,068 (Top 0.2 %)
  • Language
    Python
  • License
    MIT License
  • Created over 6 years ago
  • Updated about 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

pkuseg多领域中文分词工具; The pkuseg toolkit for multi-domain Chinese word segmentation

pkuseg:一个多领域中文分词工具包 (English Version)

pkuseg 是基于论文[Luo et. al, 2019]的工具包。其简单易用,支持细分领域分词,有效提升了分词准确度。

目录

主要亮点

pkuseg具有如下几个特点:

  1. 多领域分词。不同于以往的通用中文分词工具,此工具包同时致力于为不同领域的数据提供个性化的预训练模型。根据待分词文本的领域特点,用户可以自由地选择不同的模型。 我们目前支持了新闻领域,网络领域,医药领域,旅游领域,以及混合领域的分词预训练模型。在使用中,如果用户明确待分词的领域,可加载对应的模型进行分词。如果用户无法确定具体领域,推荐使用在混合领域上训练的通用模型。各领域分词样例可参考 example.txt
  2. 更高的分词准确率。相比于其他的分词工具包,当使用相同的训练数据和测试数据,pkuseg可以取得更高的分词准确率。
  3. 支持用户自训练模型。支持用户使用全新的标注数据进行训练。
  4. 支持词性标注。

编译和安装

  • 目前仅支持python3
  • 为了获得好的效果和速度,强烈建议大家通过pip install更新到目前的最新版本
  1. 通过PyPI安装(自带模型文件):

    pip3 install pkuseg
    之后通过import pkuseg来引用
    

    建议更新到最新版本以获得更好的开箱体验:

    pip3 install -U pkuseg
    
  2. 如果PyPI官方源下载速度不理想,建议使用镜像源,比如:
    初次安装:

    pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple pkuseg
    

    更新:

    pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple -U pkuseg
    
  3. 如果不使用pip安装方式,选择从GitHub下载,可运行以下命令安装:

    python setup.py build_ext -i
    

    GitHub的代码并不包括预训练模型,因此需要用户自行下载或训练模型,预训练模型可详见release。使用时需设定"model_name"为模型文件。

注意:安装方式1和2目前仅支持linux(ubuntu)、mac、windows 64 位的python3版本。如果非以上系统,请使用安装方式3进行本地编译安装。

各类分词工具包的性能对比

我们选择jieba、THULAC等国内代表分词工具包与pkuseg做性能比较,详细设置可参考实验环境

细领域训练及测试结果

以下是在不同数据集上的对比结果:

MSRA Precision Recall F-score
jieba 87.01 89.88 88.42
THULAC 95.60 95.91 95.71
pkuseg 96.94 96.81 96.88
WEIBO Precision Recall F-score
jieba 87.79 87.54 87.66
THULAC 93.40 92.40 92.87
pkuseg 93.78 94.65 94.21

默认模型在不同领域的测试效果

考虑到很多用户在尝试分词工具的时候,大多数时候会使用工具包自带模型测试。为了直接对比“初始”性能,我们也比较了各个工具包的默认模型在不同领域的测试效果。请注意,这样的比较只是为了说明默认情况下的效果,并不一定是公平的。

Default MSRA CTB8 PKU WEIBO All Average
jieba 81.45 79.58 81.83 83.56 81.61
THULAC 85.55 87.84 92.29 86.65 88.08
pkuseg 87.29 91.77 92.68 93.43 91.29

其中,All Average显示的是在所有测试集上F-score的平均。

更多详细比较可参见和现有工具包的比较

使用方式

代码示例

以下代码示例适用于python交互式环境。

代码示例1:使用默认配置进行分词(如果用户无法确定分词领域,推荐使用默认模型分词

import pkuseg

seg = pkuseg.pkuseg()           # 以默认配置加载模型
text = seg.cut('我爱北京天安门')  # 进行分词
print(text)

代码示例2:细领域分词(如果用户明确分词领域,推荐使用细领域模型分词

import pkuseg

seg = pkuseg.pkuseg(model_name='medicine')  # 程序会自动下载所对应的细领域模型
text = seg.cut('我爱北京天安门')              # 进行分词
print(text)

代码示例3:分词同时进行词性标注,各词性标签的详细含义可参考 tags.txt

import pkuseg

seg = pkuseg.pkuseg(postag=True)  # 开启词性标注功能
text = seg.cut('我爱北京天安门')    # 进行分词和词性标注
print(text)

代码示例4:对文件分词

import pkuseg

# 对input.txt的文件分词输出到output.txt中
# 开20个进程
pkuseg.test('input.txt', 'output.txt', nthread=20)     

其他使用示例可参见详细代码示例

参数说明

模型配置

pkuseg.pkuseg(model_name = "default", user_dict = "default", postag = False)
	model_name		模型路径。
			        "default",默认参数,表示使用我们预训练好的混合领域模型(仅对pip下载的用户)。
				"news", 使用新闻领域模型。
				"web", 使用网络领域模型。
				"medicine", 使用医药领域模型。
				"tourism", 使用旅游领域模型。
			        model_path, 从用户指定路径加载模型。
	user_dict		设置用户词典。
				"default", 默认参数,使用我们提供的词典。
				None, 不使用词典。
				dict_path, 在使用默认词典的同时会额外使用用户自定义词典,可以填自己的用户词典的路径,词典格式为一行一个词(如果选择进行词性标注并且已知该词的词性,则在该行写下词和词性,中间用tab字符隔开)。
	postag		        是否进行词性分析。
				False, 默认参数,只进行分词,不进行词性标注。
				True, 会在分词的同时进行词性标注。

对文件进行分词

pkuseg.test(readFile, outputFile, model_name = "default", user_dict = "default", postag = False, nthread = 10)
	readFile		输入文件路径。
	outputFile		输出文件路径。
	model_name		模型路径。同pkuseg.pkuseg
	user_dict		设置用户词典。同pkuseg.pkuseg
	postag			设置是否开启词性分析功能。同pkuseg.pkuseg
	nthread			测试时开的进程数。

模型训练

pkuseg.train(trainFile, testFile, savedir, train_iter = 20, init_model = None)
	trainFile		训练文件路径。
	testFile		测试文件路径。
	savedir			训练模型的保存路径。
	train_iter		训练轮数。
	init_model		初始化模型,默认为None表示使用默认初始化,用户可以填自己想要初始化的模型的路径如init_model='./models/'。

多进程分词

当将以上代码示例置于文件中运行时,如涉及多进程功能,请务必使用if __name__ == '__main__'保护全局语句,详见多进程分词

预训练模型

从pip安装的用户在使用细领域分词功能时,只需要设置model_name字段为对应的领域即可,会自动下载对应的细领域模型。

从github下载的用户则需要自己下载对应的预训练模型,并设置model_name字段为预训练模型路径。预训练模型可以在release部分下载。以下是对预训练模型的说明:

  • news: 在MSRA(新闻语料)上训练的模型。

  • web: 在微博(网络文本语料)上训练的模型。

  • medicine: 在医药领域上训练的模型。

  • tourism: 在旅游领域上训练的模型。

  • mixed: 混合数据集训练的通用模型。随pip包附带的是此模型。

我们还通过领域自适应的方法,利用维基百科的未标注数据实现了几个细领域预训练模型的自动构建以及通用模型的优化,这些模型目前仅可以在release中下载:

  • art: 在艺术与文化领域上训练的模型。

  • entertainment: 在娱乐与体育领域上训练的模型。

  • science: 在科学领域上训练的模型。

  • default_v2: 使用领域自适应方法得到的优化后的通用模型,相较于默认模型规模更大,但泛化性能更好。

欢迎更多用户可以分享自己训练好的细分领域模型。

版本历史

详见版本历史

开源协议

  1. 本代码采用MIT许可证。
  2. 欢迎对该工具包提出任何宝贵意见和建议,请发邮件至[email protected]

论文引用

该代码包主要基于以下科研论文,如使用了本工具,请引用以下论文:


@article{pkuseg,
  author = {Luo, Ruixuan and Xu, Jingjing and Zhang, Yi and Zhang, Zhiyuan and Ren, Xuancheng and Sun, Xu},
  journal = {CoRR},
  title = {PKUSEG: A Toolkit for Multi-Domain Chinese Word Segmentation.},
  url = {https://arxiv.org/abs/1906.11455},
  volume = {abs/1906.11455},
  year = 2019
}

其他相关论文

  • Xu Sun, Houfeng Wang, Wenjie Li. Fast Online Training with Frequency-Adaptive Learning Rates for Chinese Word Segmentation and New Word Detection. ACL. 2012.
  • Jingjing Xu and Xu Sun. Dependency-based gated recursive neural network for chinese word segmentation. ACL. 2016.
  • Jingjing Xu and Xu Sun. Transfer learning for low-resource chinese word segmentation with a novel neural network. NLPCC. 2017.

常见问题及解答

  1. 为什么要发布pkuseg?
  2. pkuseg使用了哪些技术?
  3. 无法使用多进程分词和训练功能,提示RuntimeError和BrokenPipeError。
  4. 是如何跟其它工具包在细领域数据上进行比较的?
  5. 在黑盒测试集上进行比较的话,效果如何?
  6. 如果我不了解待分词语料的所属领域呢?
  7. 如何看待在一些特定样例上的分词结果?
  8. 关于运行速度问题?
  9. 关于多进程速度问题?

致谢

感谢俞士汶教授(北京大学计算语言所)与邱立坤博士提供的训练数据集!

作者

Ruixuan Luo (罗睿轩), Jingjing Xu(许晶晶), Xuancheng Ren(任宣丞), Yi Zhang(张艺), Zhiyuan Zhang(张之远), Bingzhen Wei(位冰镇), Xu Sun (孙栩)

北京大学 语言计算与机器学习研究组

More Repositories

1

SGM

Sequence Generation Model for Multi-label Classification (COLING 2018)
Python
432
star
2

Chinese-Literature-NER-RE-Dataset

A Discourse-Level Named Entity Recognition and Relation Extraction Dataset for Chinese Literature Text
406
star
3

Global-Encoding

Global Encoding for Abstractive Summarization (ACL 2018)
Python
275
star
4

Graph-to-seq-comment-generation

Code for the paper ``Coherent Comments Generation for Chinese Articles with a Graph-to-Sequence Model''
Python
175
star
5

SU4MLC

Code for the article "Semantic-Unit-Based Dilated Convolution for Multi-Label Text Classification" (EMNLP 2018)
Python
154
star
6

label-words-are-anchors

Repository for Label Words are Anchors: An Information Flow Perspective for Understanding In-Context Learning
Python
148
star
7

DPGAN

Diversity-Promoting Generative Adversarial Network for Generating Informative and Diversified Text (EMNLP2018)
Python
146
star
8

superAE

Code for "Autoencoder as Assistant Supervisor: Improving Text Representation for Chinese Social Media Text Summarization"
Python
137
star
9

AdaMod

Adaptive and Momental Bounds for Adaptive Learning Rate Methods.
Python
126
star
10

text-autoaugment

[EMNLP 2021] Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification
Python
125
star
11

livebot

LiveBot: Generating Live Video Comments Based on Visual and Textual Contexts (AAAI 2019)
Python
122
star
12

meProp

meProp: Sparsified Back Propagation for Accelerated Deep Learning (ICML 2017)
C#
110
star
13

Unpaired-Sentiment-Translation

Code for "Unpaired Sentiment-to-Sentiment Translation: A Cycled Reinforcement Learning Approach" (ACL 2018)
Python
109
star
14

WEAN

Code for "Query and Output: Generating Words by Querying Distributed Word Representations for Paraphrase Generation" (NAACL 2018)
Python
93
star
15

label-embedding-network

Label Embedding Network
Python
90
star
16

Prime

A simple module consistently outperforms self-attention and Transformer model on main NMT datasets with SoTA performance.
Python
86
star
17

AAPR

Automatic Academic Paper Rating: Data and Model (ACL 2018)
Python
72
star
18

Skeleton-Based-Generation-Model

Code for "A Skeleton-Based Model for Promoting Coherence Among Sentences in Narrative Story Generation" (EMNLP 2018)
Python
65
star
19

Explicit-Sparse-Transformer

code for Explicit Sparse Transformer
Python
57
star
20

SMAE

This is the code for "Learning Sentiment Memories for Sentiment Modification without Parallel Data".
Python
55
star
21

LancoSum

A toolkit for abstractive summarization, which is easy to implement the baseline and our proposed models, which can achieve the SOTA performance.
Python
50
star
22

Seq2Set

Code for the paper "A Deep Reinforced Sequence-to-Set Model for Multi-Label Classification"
Python
50
star
23

AMM

The code for "An Auto-Encoder Matching Model for Learning Utterance-Level Semantic Dependency in Dialogue Generation" (EMNLP 2018)
Python
48
star
24

AdaNorm

Code for "Understanding and Improving Layer Normalization"
Python
45
star
25

bag-of-words

Code for "Bag-of-Words as Target for Neural Machine Translation"
Python
45
star
26

well-classified-examples-are-underestimated

Code for the AAAI 2022 publication "Well-classified Examples are Underestimated in Classification with Deep Neural Networks"
Jupyter Notebook
42
star
27

SRB

Code for "Improving Semantic Relevance for Sequence-to-Sequence Learning of Chinese Social Media Text Summarization"
Python
41
star
28

DynamicKD

Code for EMNLP 2021 main conference paper "Dynamic Knowledge Distillation for Pre-trained Language Models"
Python
40
star
29

agent-backdoor-attacks

Code&Data for the paper "Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents" [NeurIPS 2024]
Python
39
star
30

simNet

Code for "simNet: Stepwise Image-Topic Merging Network for Generating Detailed and Comprehensive Image Captions" (EMNLP 2018)
Python
37
star
31

Embedding-Poisoning

Code for the paper "Be Careful about Poisoned Word Embeddings: Exploring the Vulnerability of the Embedding Layers in NLP Models" (NAACL-HLT 2021)
Python
36
star
32

IAIS

[ACL 2021] Learning Relation Alignment for Calibrated Cross-modal Retrieval
Python
30
star
33

codable-watermarking-for-llm

Repository for Towards Codable Watermarking for Large Language Models
Python
27
star
34

Chinese-Dependency-Treebank-with-Ellipsis

An Ellipsis-aware Chinese Dependency Treebank for Web Text
Python
26
star
35

DeconvDec

Code for "Deconvolution-Based Global Decoding for Neural Machine Translation" (COLING 2018).
Python
26
star
36

tcm_prescription_generation

Code for "Exploration on Generating Traditional Chinese Medicine Prescriptions from Symptoms with an End-to-End Approach"
Python
26
star
37

CGM

Code for IJCAI 2021 main conference paper "Long-term, Short-term and Sudden Event: Trading Volume Movement Prediction with Graph-based Multi-view Modeling"
Python
23
star
38

HSSC

Code for "A Hierarchical End-to-End Model for Jointly Improving Text Summarization and Sentiment Classification" (IJCAI 2018)
Python
23
star
39

RAP

Code for the paper "RAP: Robustness-Aware Perturbations for Defending against Backdoor Attacks on NLP Models" (EMNLP 2021)
Python
22
star
40

clip-openness

[ACL 2023] Delving into the Openness of CLIP
Python
22
star
41

SOS

Code for the paper "Rethinking Stealthiness of Backdoor Attack against NLP Models" (ACL-IJCNLP 2021)
Jupyter Notebook
21
star
42

CMAC

The dataset and code for the paper "Cross-Modal Commentator: Automatic Machine Commenting Based on Cross-Modal Information"
Python
20
star
43

MUKI

[Findings of EMNLP22] From Mimicking to Integrating: Knowledge Integration for Pre-Trained Language Models
Python
19
star
44

ChineseNER

Code for "Cross-Domain and Semi-Supervised Named Entity Recognition in Chinese Social Media: A Unified Model"
Python
18
star
45

meSimp

Codes for "Training Simplification and Model Simplification for Deep Learning: A Minimal Effort Back Propagation Method"
C#
18
star
46

Avg-Avg

[Findings of EMNLP 2022] Holistic Sentence Embeddings for Better Out-of-Distribution Detection
Python
18
star
47

Pivot

Code for "Key Fact as Pivot: A Two-Stage Model for Low Resource Table-to-Text Generation" (ACL 2019)
Python
17
star
48

LexicalAT

Codes for paper "LexicalAT: Lexical-Based Adversarial Reinforcement Training for Robust Sentiment Classification"
Python
16
star
49

RMSC

Data and code for paper "Review-Driven Multi-Label Music Style Classification by Exploiting Style Correlations"
Python
14
star
50

Decode-CRF

Conditional Random Fields with Decode-based Learning
C#
14
star
51

nndep

Transition-based Dependency Parser with neural networks and hybrid oracle
C#
13
star
52

SACT

Code for the article "Automatic Temperature Control for Neural Machine Translation" (EMNLP 2018)
Python
13
star
53

SAPO

C# code for "Towards Easier and Faster Sequence Labeling for Natural Language Processing: A Search-based Probabilistic Online Learning Framework (SAPO)" (Information Sciences)
C#
13
star
54

Augmented_Data_for_FST

The augmented data of the paper "Parallel Data Augmentation for Formality Style Transfer" (ACL 2020).
12
star
55

ACA4NMT

Code of a novel model for NMT
Python
11
star
56

DCKD

Code and data for Distributional Correlation–Aware Knowledge Distillation for Stock Trading Volume Prediction (ECML-PKDD 22)
Python
11
star
57

CascadeBERT

Code for CascadeBERT, Findings of EMNLP 2021
Python
11
star
58

FedMNMT

[Findings of ACL 2023] Communication Efficient Federated Learning for Multilingual Machine Translation with Adapter
Python
10
star
59

Multi-Order-LSTM

Code for "Does Higher Order LSTM Have Better Accuracy for Segmenting and Labeling Sequence Data?"
Python
9
star
60

SemPre

Towards Semantics-Enhanced Pre-Training: Can Lexicon Definitions Help Learning Sentence Meanings? (AAAI 2021)
Python
9
star
61

DAN

[Findings of EMNLP 2022] Expose Backdoors on the Way: A Feature-Based Efficient Defense against Textual Backdoor Attacks
Python
9
star
62

Early-Exit

Code for the paper: A Global Past-Future Early Exit Method for Accelerating Inference of Pre-trained Language Models.
Python
8
star
63

CVST

Code for paper "Knowledgeable Storyteller: A Commonsense-Driven Generative Model for Visual Storytelling"
7
star
64

Multi-Task-Learning

Online Multi-Task Learning Toolkit based on C#; code for "Large-Scale Personalized Human Activity Recognition using Online Multi-Task Learning" (TKDE)
C#
6
star
65

NLP_Code_Index

codes and papers from @lancopku
5
star
66

GKD

Python
5
star
67

CRF-ADF

CRF Toolkit based on C#; support ADF (Adaptive stochastic gradient Decent based on Feature-frequency information, ACL 2012)
C#
4
star
68

Sememe_prediction

Code for paper "Sememe Prediction: Learning Semantic Knowledge from Unstructured Textual Wiki Descriptions"
Python
3
star
69

LPVDN

Python code for paper - Learning Robust Representation for Clustering through Locality Preserving Variational Discriminative Network
Python
3
star
70

Attention-Augmentation

Python
2
star
71

GNOME

Code of the EACL 2023 Paper: Fine-Tuning Deteriorates General Textual Out-of-Distribution Detection by Distorting Task-Agnostic Features
1
star
72

MR-VPC

Towards Multimodal Video Paragraph Captioning Models Robust to Missing Modality
1
star