• Stars
    star
    325
  • Rank 125,310 (Top 3 %)
  • Language
    Java
  • License
    MIT License
  • Created about 8 years ago
  • Updated over 6 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

An Efficient Lexical Analyzer for Chinese

THULAC:一个高效的中文词法分析工具包

本文档只针对THULAC的java版本,其他版本的使用方式请查阅对应的README文件。

目录

项目介绍

THULAC (THU Lexical Analyzer for Chinese) 是由清华大学自然语言处理与社会人文计算实验室研制推出的一套中文词法分析工具包,具有中文分词和词性标注功能。THULAC具有如下几个特点:

  1. 能力强。利用我们集成的目前世界上规模最大的人工分词和词性标注中文语料库(约含5800万字)训练而成,模型标注能力强大。
  2. 准确率高。该工具包在标准数据集Chinese Treebank (CTB5) 上分词的F1值可达97.3%,词性标注的F1值可达到92.9%,与该数据集上最好方法效果相当。
  3. 速度较快。同时进行分词和词性标注速度为300KB/s,每秒可处理约15万字。只进行分词速度可达到1.3MB/s。(该数据取自本库的c++版本,java版本可能速度略慢)

编译和安装

  • 可执行jar包

本库正在持续开发中,请参阅下文自行编译运行。

  • 下载源代码编译运行

下载编译本库需要计算机上已安装java, gitgradle,以及稳定可靠的网络连接。 之后,运行命令行:

git clone https://github.com/thunlp/THULAC-Java.git

执行完毕后,运行命令行:

gradle check

如果控制台上打出BUILD SUCCESSFUL字样则说明编译成功。

使用方式

1. 分词和词性标注程序

1.1. 命令格式

从命令行输入输出:
java -jar THULAC_lite_java_run.jar [-t2s] [-seg_only] [-deli delimiter] [-user userdict.txt]
从文本文件(UTF-8编码)输入输出:
java -jar THULAC_lite_java_run.jar [-t2s] [-seg_only] [-deli delimiter] [-user userdict.txt] -input input_file -output output_file

1.2. 命令参数

参数名称 含义
-t2s 将句子从繁体转化为简体
-seg_only 只进行分词,不进行词性标注
-deli delimiter 将词与词性间的分隔符设置为delimiter,默认为下划线_
-filter 使用过滤器去除一些没有意义的词语,例如“可以”。
-user userdict.txt 设置用户词典为userdict.txt,词典中的词会被打上uw标签。词典中每一个词一行,UTF8编码
-model_dir dir 设置模型文件所在文件夹为dir,默认为models/
-input input_file 设置输入文件为input_file,默认为命令行输入
-output output_file 设置输出文件为output_file,默认为命令行输出

2. 获取模型

THULAC需要分词和词性标注模型的支持,获取下载好的模型用户可以登录thulac.thunlp.org网站填写个人信息进行下载,并放到THULAC的根目录即可,或者使用参数-model_dir dir指定模型的位置。

代表分词软件的性能对比

我们选择LTP、ICTCLAS、结巴分词等国内代表分词软件与THULAC做性能比较。我们选择Windows作为测试环境,根据第二届国际汉语分词测评发布的国际中文分词测评标准,对不同软件进行了速度和准确率测试。

在第二届国际汉语分词测评中,共有四家单位提供的测试语料 (Academia Sinica, City University, Peking University, Microsoft Research), 在评测提供的资源icwb2-data中包含了来自这四家单位的训练集(training)、测试集(testing), 以及根据各自分词标准而提供的相应测试集的标准答案 (icwb2-data/scripts/gold).在icwb2-data/scripts目录下含有对分词进行自动评分的perl脚本score。

我们在统一测试环境下,对若干流行分词软件和THULAC进行了测试,使用的模型为各分词软件自带模型。THULAC使用的是随软件提供的简单模型Model_1。评测环境为 Intel Core i5 2.4 GHz 评测结果如下:

msr_test(560KB)

Algorithm Time Precision Recall
LTP-3.2.0 3.21s 0.867 0.896
ICTCLAS(2015版) 0.55s 0.869 0.914
jieba 0.26s 0.814 0.809
THULAC 0.62s 0.877 0.899

pku_test(510KB)

Algorithm Time Precision Recall
LTP-3.2.0 3.83s 0.960 0.947
ICTCLAS(2015版) 0.53s 0.939 0.944
jieba 0.23s 0.850 0.784
THULAC 0.51s 0.944 0.908

除了以上在标准测试集上的评测,我们也对各个分词工具在大数据上的速度进行了评测,结果如下:

CNKI_journal.txt(51 MB)

Algorithm Time Speed
LTP-3.2.0 348.624s 149.80KB/s
ICTCLAS(2015版) 106.461s 490.59KB/s
jieba 22.5583s 2314.89KB/s
THULAC 42.625s 1221.05KB/s

词性解释

n/名词 np/人名 ns/地名 ni/机构名 nz/其它专名
m/数词 q/量词 mq/数量词 t/时间词 f/方位词 s/处所词
v/动词 vm/能愿动词 vd/趋向动词 a/形容词 d/副词
h/前接成分 k/后接成分 i/习语 j/简称
r/代词 c/连词 p/介词 u/助词 y/语气助词
e/叹词 o/拟声词 g/语素 w/标点 x/其它

THULAC模型介绍

  1. 我们随THULAC源代码附带了简单的分词模型Model_1,仅支持分词功能。该模型由人民日报分词语料库训练得到。
  2. 我们随THULAC源代码附带了分词和词性标注联合模型Model_2,支持同时分词和词性标注功能。该模型由人民日报分词和词性标注语料库训练得到。
  3. 我们还提供更复杂、完善和精确的分词和词性标注联合模型Model_3和分词词表。该模型是由多语料联合训练训练得到(语料包括来自多文体的标注文本和人民日报标注文本等)。由于模型较大,如有机构或个人需要,请填写“doc/资源申请表.doc”,并发送至 [email protected],通过审核后我们会将相关资源发送给联系人。

注意事项

该工具目前仅处理UTF8编码中文文本,之后会逐渐增加支持其他编码的功能,敬请期待。

其他语言实现

更新历史

更新时间 更新内容
2016-09-29 增加THULAC分词so版本。
2016-03-31 增加THULAC分词python版本。
2016-01-20 增加THULAC分词Java版本。
2016-01-10 开源THULAC分词工具C++版本。

开源协议

  1. THULAC面向国内外大学、研究所、企业以及个人用于研究目的免费开放源代码。
  2. 如有机构或个人拟将THULAC用于商业目的,请发邮件至[email protected]洽谈技术许可协议。
  3. 欢迎对该工具包提出任何宝贵意见和建议。请发邮件至[email protected]
  4. 如果您在THULAC基础上发表论文或取得科研成果,请您在发表论文和申报成果时声明“使用了清华大学THULAC”,并按如下格式引用:

中文:

孙茂松, 陈新雄, 张开旭, 郭志芃, 刘知远. THULAC:一个高效的中文词法分析工具包. 2016.

英文:

Maosong Sun, Xinxiong Chen, Kaixu Zhang, Zhipeng Guo, Zhiyuan Liu. THULAC: An Efficient Lexical Analyzer for Chinese. 2016.

相关论文

  • Zhongguo Li, Maosong Sun. Punctuation as Implicit Annotations for Chinese Word Segmentation. Computational Linguistics, vol. 35, no. 4, pp. 505-512, 2009.

作者

Maosong Sun (孙茂松,导师), Xinxiong Chen(陈新雄,博士生), Kaixu Zhang (张开旭,硕士生), Zhipeng Guo(郭志芃,本科生), Zhiyuan Liu(刘知远,助理教授).

致谢

std4453:对此java版本的代码进行优化和添加注释。

More Repositories

1

GNNPapers

Must-read papers on graph neural networks (GNN)
15,490
star
2

WantWords

An open-source online reverse dictionary.
JavaScript
6,933
star
3

OpenNRE

An Open-Source Package for Neural Relation Extraction (NRE)
Python
4,232
star
4

OpenPrompt

An Open-Source Framework for Prompt-Learning.
Python
4,145
star
5

PromptPapers

Must-read papers on prompt-based tuning for pre-trained language models.
3,912
star
6

OpenKE

An Open-Source Package for Knowledge Embedding (KE)
Python
3,725
star
7

PLMpapers

Must-read Papers on pre-trained language models.
3,161
star
8

NRLPapers

Must-read papers on network representation learning (NRL) / network embedding (NE)
TeX
2,520
star
9

UltraChat

Large-scale, Informative, and Diverse Multi-round Chat Data (and Models)
Python
2,118
star
10

THULAC-Python

An Efficient Lexical Analyzer for Chinese
Python
1,972
star
11

OpenNE

An Open-Source Package for Network Embedding (NE)
Python
1,672
star
12

KRLPapers

Must-read papers on knowledge representation learning (KRL) / knowledge embedding (KE)
TeX
1,528
star
13

TAADpapers

Must-read Papers on Textual Adversarial Attack and Defense
Python
1,459
star
14

ERNIE

Source code and dataset for ACL 2019 paper "ERNIE: Enhanced Language Representation with Informative Entities"
Python
1,403
star
15

KB2E

Knowledge Graph Embeddings including TransE, TransH, TransR and PTransE
C++
1,360
star
16

NREPapers

Must-read papers on neural relation extraction (NRE)
TeX
1,023
star
17

OpenCLaP

Open Chinese Language Pre-trained Model Zoo
971
star
18

WebCPM

Official codes for ACL 2023 paper "WebCPM: Interactive Web Search for Chinese Long-form Question Answering"
HTML
952
star
19

OpenDelta

A plug-and-play library for parameter-efficient-tuning (Delta Tuning)
Python
938
star
20

RCPapers

Must-read papers on Machine Reading Comprehension
890
star
21

NRE

Neural Relation Extraction, including CNN, PCNN, CNN+ATT, PCNN+ATT
C++
812
star
22

ToolLearningPapers

777
star
23

THULAC

An Efficient Lexical Analyzer for Chinese
C++
772
star
24

FewRel

A Large-Scale Few-Shot Relation Extraction Dataset
Python
716
star
25

THUOCL

THUOCL(THU Open Chinese Lexicon)中文词库
697
star
26

Chinese_Rumor_Dataset

中文谣言数据
672
star
27

OpenAttack

An Open-Source Package for Textual Adversarial Attack.
Python
652
star
28

DocRED

Dataset and codes for ACL 2019 DocRED: A Large-Scale Document-Level Relation Extraction Dataset.
Python
605
star
29

OpenHowNet

Core Data of HowNet and OpenHowNet Python API
Python
592
star
30

TensorFlow-TransX

An implementation of TransE and its extended models for Knowledge Representation Learning on TensorFlow
Python
511
star
31

LegalPapers

Must-read Papers on Legal Intelligence
450
star
32

OpenMatch

An Open-Source Package for Information Retrieval.
Python
444
star
33

CAIL

Chinese AI & Law Challenge
439
star
34

BERT-KPE

Python
437
star
35

Fast-TransX

An Efficient implementation of TransE and its extended models for Knowledge Representation Learning
C++
396
star
36

TensorFlow-Summarization

Python
390
star
37

Few-NERD

Code and data of ACL 2021 paper "Few-NERD: A Few-shot Named Entity Recognition Dataset"
Python
376
star
38

SOS4NLP

Survey of Surveys for Natural Language Processing (SOS4NLP)
327
star
39

NSC

Neural Sentiment Classification
Python
287
star
40

BMCourse

The repo for Tsinghua summer course: Interdisciplinary Seminar on Big Models
Python
269
star
41

Chinese_NRE

Source code for ACL 2019 paper "Chinese Relation Extraction with Multi-Grained Information and External Linguistic Knowledge"
Python
264
star
42

DeltaPapers

Must-read Papers of Parameter-Efficient Tuning (Delta Tuning) Methods on Pre-trained Models.
259
star
43

PL-Marker

Source code for "Packed Levitated Marker for Entity and Relation Extraction"
Python
252
star
44

SE-WRL

Improved Word Representation Learning with Sememes
C
197
star
45

THUCTC

An Efficient Chinese Text Classifier
Java
196
star
46

InfLLM

The code of our paper "InfLLM: Unveiling the Intrinsic Capacity of LLMs for Understanding Extremely Long Sequences with Training-Free Memory"
Python
196
star
47

SCPapers

Must-read Papers on Sememe Computation
193
star
48

KnowledgeablePromptTuning

kpt code
Python
192
star
49

CANE

Source code and datasets of "CANE: Context-Aware Network Embedding for Relation Modeling"
Python
190
star
50

JointNRE

Joint Neural Relation Extraction with Text and KGs
Python
185
star
51

HATT-Proto

Code and dataset of AAAI2019 paper Hybrid Attention-Based Prototypical Networks for Noisy Few-Shot Relation Classification
Python
180
star
52

LLaVA-UHD

LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images
Python
169
star
53

NLP-THU

NLP Course Material & QA
164
star
54

KernelGAT

The source codes for Fine-grained Fact Verification with Kernel Graph Attention Network.
Python
161
star
55

LegalPLMs

Source code and checkpoints for legal pre-trained language models.
Python
158
star
56

EntityDuetNeuralRanking

Entity-Duet Neural Ranking Model
Python
153
star
57

PTR

Prompt Tuning with Rules
Python
151
star
58

OOP-THU

OOP Course Material & QA
149
star
59

Auto_CLIWC

Code for Chinese LIWC Lexicon Expansion via Hierarchical Classification of Word Embeddings with Sememe Attention (AAAI18)
Python
136
star
60

OpenBackdoor

An open-source toolkit for textual backdoor attack and defense (NeurIPS 2022 D&B, Spotlight)
Python
135
star
61

attribute_charge

The source code of our COLING'18 paper "Few-Shot Charge Prediction with Discriminative Legal Attributes".
Python
126
star
62

ConceptFlow

Python
119
star
63

THUCKE

THU Chinese Keyphrase Extraction Toolkit
C++
118
star
64

CAIL2018

Python
111
star
65

KR-EAR

Knowledge Representation Learning with Entities, Attributes and Relations
C++
111
star
66

Neural-Snowball

Code and dataset of AAAI2020 Paper Neural Snowball for Few-Shot Relation Learning
Python
111
star
67

ChatEval

Codes for our paper "ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate"
Python
109
star
68

MultiRD

Code and data of the AAAI-20 paper "Multi-channel Reverse Dictionary Model"
Python
106
star
69

TransNet

Source code and datasets of IJCAI2017 paper "TransNet: Translation-Based Network Representation Learning for Social Relation Extraction".
Jupyter Notebook
103
star
70

RE-Context-or-Names

Bert-based models(BERT, MTB, CP) for relation extraction.
Python
100
star
71

AGE

Source code and dataset for KDD 2020 paper "Adaptive Graph Encoder for Attributed Graph Embedding"
Python
99
star
72

GEAR

Source code for ACL 2019 paper "GEAR: Graph-based Evidence Aggregating and Reasoning for Fact Verification"
Python
95
star
73

HNRE

Hierarchical Neural Relation Extraction
Python
95
star
74

LEVEN

Source code and dataset for ACL2022 Findings Paper "LEVEN: A Large-Scale Chinese Legal Event Detection dataset"
Python
94
star
75

TopJudge

Python
93
star
76

Prompt-Transferability

On Transferability of Prompt Tuning for Natural Language Processing
Python
85
star
77

SememePSO-Attack

Code and data of the ACL 2020 paper "Word-level Textual Adversarial Attacking as Combinatorial Optimization"
Python
85
star
78

XQA

Dataset and baseline for ACL 2019 paper "XQA: A Cross-lingual Open-domain Question Answering Dataset"
Python
84
star
79

HMEAE

Source code for EMNLP-IJCNLP 2019 paper "HMEAE: Hierarchical Modular Event Argument Extraction".
Python
84
star
80

ERICA

Source code for ACL 2021 paper "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive Learning"
Python
82
star
81

CLAIM

77
star
82

TKRL

Representation Learning of Knowledge Graphs with Hierarchical Types (IJCAI-2016)
C++
76
star
83

TLNN

Source code for EMNLP-IJCNLP 2019 paper "Event Detection with Trigger-Aware Lattice Neural Network".
Python
75
star
84

MMDW

Max-margin DeepWalk
Java
71
star
85

KV-PLM

Source code for "A Deep-learning System Bridging Molecule Structure and Biomedical Text with Comprehension Comparable to Human Professionals"
Python
71
star
86

KNET

Neural Entity Typing with Knowledge Attention
Python
69
star
87

SelectiveMasking

Source code for "Train No Evil: Selective Masking for Task-Guided Pre-Training"
Python
68
star
88

NeuIRPapers

Must-read Papers on Neural Information Retrieval
68
star
89

MoEfication

Python
66
star
90

Adv-ED

Source code and dataset for NAACL 2019 paper "Adversarial Training for Weakly Supervised Event Detection".
Python
66
star
91

CorefBERT

Source code for EMNLP 2020 paper "Coreferential Reasoning Learning for Language Representation"
Python
65
star
92

ConversationQueryRewriter

Code and Data for SIGIR 2020 Paper "Few-Shot Generative Conversational Query Rewriting"
Roff
63
star
93

MuGNN

Source code for ACL2019 paper "Multi-Channel Graph Neural Network for Entity Alignment".
Python
62
star
94

sememe_prediction

Codes for Lexical Sememe Prediction via Word Embeddings and Matrix Factorization (IJCAI 2017).
Python
60
star
95

DIAG-NRE

Source code for ACL 2019 paper "DIAG-NRE: A Neural Pattern Diagnosis Framework for Distantly Supervised Neural Relation Extraction".
Python
59
star
96

topical_word_embeddings

Topical Word Embeddings
Python
57
star
97

QuoteR

Official code and data of the ACL 2022 paper "QuoteR: A Benchmark of Quote Recommendation for Writing"
Python
57
star
98

paragraph2vec

Paragraph Vector Implementation
Python
56
star
99

DKRL

Representation Learning of Knowledge Graphs with Entity Descriptions (AAAI-2016)
C++
54
star
100

Ouroboros

Ouroboros: Speculative Decoding with Large Model Enhanced Drafting
Python
51
star