• Stars
    star
    155
  • Rank 240,864 (Top 5 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created almost 6 years ago
  • Updated almost 6 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Sequence labeling base on universal transformer (Transformer encoder) and CRF; 基于Universal Transformer + CRF 的中文分词和词性标注

transformer-word-segmenter

中文版本

This is a sequence labelling model base on Universal Transformer (Encoder) + CRF which can be used for word segmentation.

Install

Just use setup.sh to install.

Usage

You can simplely use factory method get_or_create to get model.

from tf_segmenter import get_or_create, TFSegmenter

if __name__ == '__main__':
    segmenter: TFSegmenter = get_or_create("../data/default-config.json",
                                           src_dict_path="../data/src_dict.json",
                                           tgt_dict_path="../data/tgt_dict.json",
                                           weights_path="../models/weights.129-0.00.h5")

It accepts four params:

  • config: which indicates the configuration used by the model
  • src_dict_path: which indicates the dictionary file for texts.
  • tgt_dict_path: which indicates the dictionary file for tags.
  • weights_path: weights file model used.

And then, call decode_texts to cut setences.

texts = [

        "巴纳德星的名字起源于一百多年前一位名叫爱德华·爱默生·巴纳德的天文学家。"
        "他发现有一颗星在夜空中划过的速度很快,这引起了他极大的注意。"
        ,
        "印度尼西亚国家抗灾署此前发布消息证实,印尼巽他海峡附近的万丹省当地时间22号晚遭海啸袭击。"
    ]

for sent, tag in segmenter.decode_texts(texts):
    print(sent)
    print(tag)

Results:

['巴纳德', '星', '的', '名字', '起源于', '一百', '多年前', '一位', '名叫', '爱德华·爱默生·巴纳德', '的', '天文学家', '。', '他', '发现', '有', '一颗', '星', '在', '夜空', '中', '划过', '的', '速度', '很快', ',', '这', '引起', '了', '他', '极大', '的', '注意', '。']
['nrf', 'n', 'ude1', 'n', 'v', 'm', 'd', 'mq', 'v', 'nrf', 'ude1', 'nnd', 'w', 'rr', 'v', 'vyou', 'mq', 'n', 'p', 'n', 'f', 'v', 'ude1', 'n', 'd', 'w', 'rzv', 'v', 'ule', 'rr', 'a', 'ude1', 'vn', 'w']

['印度尼西亚国家抗灾署', '此前', '发布', '消息', '证实', ',', '印尼巽他海峡', '附近', '的', '万丹省', '当地时间', '22号', '晚', '遭', '海啸', '袭击', '。']
['nt', 't', 'v', 'n', 'v', 'w', 'ns', 'f', 'ude1', 'ns', 'nz', 'mq', 'tg', 'v', 'n', 'vn', 'w']

It can also identify PEOPLE, ORG or PLACE such as 印度尼西亚国家抗灾署万丹省 and so on.

config, weigts and dictionaries link:

https://pan.baidu.com/s/1iHADmnSEywoVqq_-nb0bOA password: v34g

Dataset Process

baidu: https://pan.baidu.com/s/1EtXdhPR0lGF8c7tT8epn6Q password: yj9j

Convert dataset format

The data format in dataset as follow is not what we liked.

嫌疑人\n 赵国军\nr 。\w

We convert it by command:

python ner_data_preprocess.py <src_dir> 2014_processed -c True

Where <src_dir> indicates training dataset dir, such as ./2014-people/train.

Now, the data in file 2014_processed can be seen as follow:

嫌 疑 人 赵 国 军 。 B-N I-N I-N B-NR I-NR I-NR S-W

Make dictionaries

After data format converted, we expect to make dictionaries:

python tools/make_dicts.py 2014_processed -s src_dict.json -t tgt_dict.json

This will generate two file:

  • src_dict.json
  • tgt_dict.json

Convert to hdf5

In order to speed up performance, you can convert pure txt 2014_processed to hdf5 file.

python tools/convert_to_h5.py 2014_processed 2014_processed.h5 -s src_dict.json -t tgt_dict.json

Training Result

The config used as follow:

{
    "src_vocab_size": 5649,
    "tgt_vocab_size": 301,
    "max_seq_len": 150,
    "max_depth": 2,
    "model_dim": 256,
    "embedding_size_word": 300,
    "embedding_dropout": 0.0,
    "residual_dropout": 0.1,
    "attention_dropout": 0.1,
    "output_dropout": 0.0,
    "l2_reg_penalty": 1e-6,
    "confidence_penalty_weight": 0.1,
    "compression_window_size": None,
    "num_heads": 2,
    "use_crf": True
}

And with:

param value
batch_size 32
steps_per_epoch 2000
validation_steps 50
warmup 6000

The training data is divided into training set and verification set according to the ratio of 8:2.

see more: examples\train_example.py

After 50 epochs, the accuracy of the verification set reached 98 %, the convergence time is almost the same as BiLSTM+CRF, but the number of parameters is reduced by about 200,000.

Test set (2014-people/test) evaluation results for word segmetion:

result-(epoch:50):
Num of words20744, accuracy rate0.958639, error rate0.046712
Num of lines317, accuracy rate0.406940, error rate0.593060
Recall: 0.958639
Precision: 0.953536
F MEASURE: 0.956081
ERR RATE: 0.046712
====================================
result-(epoch:86):
Num of words20744accuracy rate0.962784error rate0.039240
Num of lines317accuracy rate0.454259error rate0.545741
Recall: 0.962784
Precision: 0.960839
F MEASURE: 0.961811
ERR RATE: 0.039240

References

  1. Universal Transformer https://github.com/GlassyWing/keras-transformer
  2. Transformer https://github.com/GlassyWing/transformer-keras

More Repositories

1

bi-lstm-crf

使用keras实现的基于Bi-LSTM + CRF的中文分词+词性标注
Python
375
star
2

text-detection-ocr

Chinese text detection and recognition based on CTPN + DENSENET using Keras and Tensor Flow,使用keras和tensorflow基于CTPN+Densenet实现的中文文本检测和识别
Python
284
star
3

nvae

An unofficial toy implementation for NVAE 《A Deep Hierarchical Variational Autoencoder》
Python
108
star
4

fourier-feature-networks

An unofficial pytorch implementation of 《Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains》
Python
65
star
5

yolo_deepsort

Fast MOT base on yolo+deepsort, support yolo3 and yolo4
Python
53
star
6

transformer-keras

Using Keras + Tensor Flow to Implement Model Transformer in Paper "Attention Is All You Need". 使用 keras+tensorflow 实现论文"Attention Is All You Need"中的模型Transformer。
Python
33
star
7

stomp_ws_py

Stomp over websocket for python
Python
20
star
8

better-jieba

更好的jieba java版
Java
18
star
9

ode_rnn

RNN 拟合微分算子
Python
15
star
10

keras_dataloader

DataLoader for keras
Python
12
star
11

ogan-torch

An unofficial pytorch implementation for OGAN
Python
9
star
12

graph_gru

Reconstructed GRU, used to process the graph sequence.
Python
8
star
13

stable-net

Unofficial pytorch implementation of StableNet
Python
8
star
14

TorchDiffusion

One Diffusion model implementation base on LibTorch
C++
8
star
15

gon_emb

Using fourier feature mapping Strengthen GON (《Gradient Origin Networks》)'s performance
Python
7
star
16

searching-recommend

基于solr和协同过滤算法的构件检索与推荐系统
Java
5
star
17

nn_precipitation_forecast

基于神经网络进行1-60天降水预测研究,Precipitation Forecast Based on Neural Network
4
star
18

components-recommend

基于itemCF的物品间具有先后关系的构件推荐模型
Scala
3
star
19

OnLSTM-torch

OnLSTM implemented by pytorch
Python
2
star
20

learn-spring

This is a maven project that contains source code of learning spring
Java
2
star
21

tobacco_diseases_example

这是烟草病害预测工具的使用示例
Python
1
star
22

nn_book_answers

《神经网络与深度学习》习题
1
star
23

stock_crawler

股票行情爬虫
Python
1
star
24

DailyPaper

这是一个基于Material Design设计的新闻阅读客户端
Java
1
star
25

yolo3_torch

yolo3 视频检测
Python
1
star
26

srapp

基于solr的构件检索与推荐系统 web端应用
Vue
1
star
27

tokenizer

使用ND4J实现基于Bi-LSTM + CRF 模型的中文分词
Java
1
star
28

GlassyWing.github.io

个人博客发布页
HTML
1
star
29

sent_embedding

sentence embedding
Python
1
star
30

spark-runner

执行spark graphx
Java
1
star