hankcs/ID-CNN-CWS

Stars
136
Rank 267,670 (Top 6 %)
Language
Python
License
GNU General Publi...
Created about 7 years ago
Updated over 3 years ago

hankcs/ID-CNN-CWS

hankcs

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Source codes and corpora of paper "Iterated Dilated Convolutions for Chinese Word Segmentation"

ID-CNN-CWS

Source codes and corpora of paper "Iterated Dilated Convolutions for Chinese Word Segmentation" published in NNW journal.

It implements the following 4 models for CWS:

Bi-LSTM
Bi-LSTM-CRF
ID-CNN
ID-CNN-CRF

Dependencies

Python >= 3.6
TensorFlow >= 1.2

Both CPU and GPU are supported. GPU training is 10 times faster.

Preparation

Run following script to convert corpus to TensorFlow dataset.

$ ./scripts/make.sh

Train and Test

Quick Start

$ ./scripts/run.sh $dataset $model

$dataset can be pku, msr, asSC or cityuSC.
$model can be cnn or bilstm.

For example:

$ ./scripts/run.sh pku cnn

It will train a cnn model on pku dataset, then evaluate performance on test set.

CRF Layer

To enable CRF layer, simply append --viterbi to your command, e.g.

$ ./scripts/run.sh pku cnn --viterbi

Accuracy

Speed

Acknowledgments

Corpora are from SIGHAN05, converted to Simplified Chinese via HanLP. Note that the SIGHAN datasets should only be used for research purposes.
Model implementations adopted from https://github.com/iesl/dilated-cnn-ner by Emma Strubell.

HanLP

中文分词词性标注命名实体识别依存句法分析成分句法分析语义依存分析语义角色标注指代消解风格转换语义相似度新词发现关键词短语提取自动摘要文本分类聚类拼音简繁转换自然语言处理

pyhanlp

AhoCorasickDoubleArrayTrie

An extremely fast implementation of Aho Corasick algorithm based on Double Array Trie.

CS224n

CS224n: Natural Language Processing with Deep Learning Assignments Winter, 2017

Viterbi

An implementation of HMM-Viterbi Algorithm 通用的维特比算法实现

multi-criteria-cws

Simple Solution for Multi-Criteria Chinese Word Segmentation

hanlp-lucene-plugin

HanLP中文分词Lucene插件，支持包括Solr在内的基于Lucene的系统

TextRank

TextRank算法提取关键词的Java实现

LDA4j

A Java implemention of LDA(Latent Dirichlet Allocation)

TreebankPreprocessing

Python scripts preprocessing Penn Treebank and Chinese Treebank

MainPartExtractor

主谓宾提取器的Java实现（对斯坦福的代码失去兴趣，不再维护）

neural_net

反向传播神经网络及应用

udacity-deep-learning

Assignments for Udacity Deep Learning class with TensorFlow in PURE Python, not IPython Notebook

AveragedPerceptronPython

Clone of "A Good Part-of-Speech Tagger in about 200 Lines of Python" by Matthew Honnibal

MaxEnt

这是一个最大熵的简明Java实现，提供提供训练与预测接口。训练算法采用GIS训练算法，附带示例训练集和一个天气预测的Demo。

text-classification-svm

The missing SVM-based text classification module implementing HanLP's interface

IceNAT

BERT-token-level-embedding

Generate BERT token level embedding without pain

sub-character-cws

Sub-Character Representation Learning

HanLPAndroidDemo

HanLP Android Demo

maxent_iis

最大熵-IIS（Improved Iterative Scaling）训练算法的Java实现

gohanlp

Golang RESTful Client for HanLP

iparser

Yet another dependency parser, integrated with tokenizer, tagger and visualization tool.

DeepBiaffineParserMXNet

An experimental implementation of biaffine parser using MXNet

OpenCC-to-HanLP

无损转换OpenCC词典为HanLP格式

tmsvm

bolt_splits

Split Broad Operational Language Translation corpus into train/dev/test set