• Stars
    star
    1,194
  • Rank 39,182 (Top 0.8 %)
  • Language
    Python
  • License
    GNU General Publi...
  • Created about 5 years ago
  • Updated about 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

2019-SOTA简繁中文拼写检查工具:FASPell Chinese Spell Checker (Chinese Spell Check / 中文拼写检错 / 中文拼写纠错 / 中文拼写检查)

中文版 README

FASPell

This repository (licensed under GNU General Public License v3.0) contains all the data and code you need to build a state-of-the-art (by early 2019) Chinese spell checker and replicate the experiments in the original paper:

FASPell: A Fast, Adaptable, Simple, Powerful Chinese Spell Checker Based On DAE-Decoder Paradigm LINK

, which is published in the Proceedings of the 2019 EMNLP Workshop W-NUT: The 5th Workshop on Noisy User-generated Text.

Upon use of our code and data, please cite our work as:

@inproceedings{hong2019faspell,
    title = "{FASP}ell: A Fast, Adaptable, Simple, Powerful {C}hinese Spell Checker Based On {DAE}-Decoder Paradigm",
    author = "Hong, Yuzhong  and
      Yu, Xianguo  and
      He, Neng  and
      Liu, Nan  and
      Liu, Junhui",
    booktitle = "Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019)",
    month = nov,
    year = "2019",
    address = "Hong Kong, China",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/D19-5522",
    pages = "160--169",
}

Overview

The task of Chinese spell checking (CSC) is conventionally reduced to detecting and correcting substitution errors in Chinese texts. Other types of errors such as deletion/insertion errors are relatively rare.

FASPell is a Chinese spell checker which allows you to readily spell-check any kind of Chinese texts (simplified Chinese texts; traditional Chinese texts; human essays; OCR results, etc.) with state-of-the-art performance.

model_fig.png

Performance

The following figures describe the performance of FASPell on the SIGHAN15 test set.

sentence-level performance is:

Precision Recall
Detection 67.6% 60.0%
Correction 66.6% 59.1%

character-level performance is:

Precision Recall
Detection 76.2% 67.1%
Correction 73.5% 64.8%

This means that approximately 7 out of 10 error detections/corrections are correct and 6 out 10 errors can be successfully detected/corrected.

Usage

This is a step-by-step instruction for you to build a Chinese spell checker

requirements

python == 3.6
tensorflow >= 1.7
matplotlib
tqdm
java (required only if tree edit distance is used)
apted.jar (required only if tree edit distance is used)

data preparation

In this step, you download all the data here. The data include the spell checking data (for both training and testing) and the character features used to compute character similarity.

Since most data used in FASPell is from other providers, note that downloaded data should be converted to our desired formats.

In the repo, we provide a few sample data as placeholders. Remember to overwrite them with the same file name once you have all the data.

After this step, if you are interested, you should be able to use the following script to compute character similarity:

$ python char_sim.py 午 牛 年 千

Note that FASPell only adopts string edit distance to compute similarity. If you are interested in using tree edit distance to compute similarity, you need to download (from here) and compile a tree edit distance executable apted.jar into the home directory before running:

$ python char_sim.py 午 牛 年 千 -t

training and tuning

We highly recommend reading our paper before implementing this step.

There are three training processes (in order). Click the links to get their details:

  1. pre-training a masked LM: see here
  2. fine-tuning the masked LM: see here
  3. tuning the filters in CSD: see here

runing the spell checker

Check if your working directory now looks like:

FASPell/
  - bert_modified/
      - create_data.py
      - create_tf_record.py
      - modeling.py
      - tokenization.py
  - data/
      - char_meta.txt
  - model/
      - fine-tuned/
          - model.ckpt-10000.data-00000-of-00001
          - model.ckpt-10000.index
          - model.ckpt-10000.meta
      - pre-trained/
          - bert_config.json
          - bert_model.ckpt.data-00000-of-00001
          - bert_model.ckpt.index
          - bert_model.ckpt.meta
          - vocab.txt
  - plots/
      ...
  - char_sim.py
  - faspell.py
  - faspell_configs.json
  - masked_lm.py
  - plot.py

Now, you should be able to use the following command to spell-check Chinese sentences:

$ python faspell.py 扫吗关注么众号 受奇艺全网首播

You can also check sentences in a file (one sentence per line):

$ python faspell.py -m f -f /path/to/your/file

To test the spell checker on a test set, set "testing_set" in faspell_configs.json to the path of the test set and run:

$ python faspell.py -m e

You can set "round" in faspell_configs.json to different values and run the above commands to find the best number of rounds.

Data

Chinese spell checking data

  1. human generated data:
    • SIGHAN-2013 shared task on CSC: LINK
    • SIGHAN-2014 shared task on CSC: LINK
    • SIGHAN-2015 shared task on CSC: LINK
  2. machine generated data:
    • OCR results used in our paper:

To use our code, the format of spell checking data should be like the following examples:

#(errors)	erroneous sentence	correct sentence
0	你好!我是張愛文。	你好!我是張愛文。
1	下個星期,我跟我朋唷打算去法國玩兒。	下個星期,我跟我朋友打算去法國玩兒。
0	我聽說,你找到新工作,我很高興。	我聽說,你找到新工作,我很高興。
1	對不氣,最近我很忙,所以我不會去妳的。	對不起,最近我很忙,所以我不會去妳的。
1	真麻煩你了。希望你們好好的跳無。	真麻煩你了。希望你們好好的跳舞。
3	我以前想要高訴你,可是我忘了。我真戶禿。	我以前想要告訴你,可是我忘了。我真糊塗。

Features of Chinese characters:

We use the features from two open databases. Please check their licenses before use.

Database Name Data Link File used in our paper
Visual features 漢字データベースプロジェクト(Kanji Database Project) LINK ids.txt
Phonological features Unihan Database LINK Unihan_Readings.txt

※ Note that the original ids.txt per se does not provide stroke-level IDS (for compression purpose). However, it should be easy to use tree recursion (starting from the IDS of simple characters which do have stroke-level IDS) to produce stroke-level IDS for all characters yourself.

The feature file (char_meta.txt) that can be used with our code should have the following format:

unicode	character	pronunciations of CJKV lanugages	stroke-level IDS
U+4EBA	人	ren2;jan4;IN;JIN,NIN;nhân	⿰丿㇏
U+571F	土	du4,tu3,cha3,tu2;tou2;TWU,THO;DO,TO;thổ	⿱⿻一丨一
U+7531	由	you2,yao1;jau4;YU;YUU,YUI,YU;do	⿻⿰丨𠃌⿱⿻一丨一
U+9A6C	马	ma3;maa5;null;null;null	⿹⿱𠃍㇉一
U+99AC	馬	ma3;maa5;MA;MA,BA,ME;mã	⿹⿱⿻⿱一⿱一一丨㇉灬

where:

  • the CJKV pronuncation string follows the format of: MC;CC;K;JO;V;
  • when a character in a language is polyphonic, possible sounds are seperated with ,;
  • when a character does not have a pronunciation in a language, use null as a placeholder.

Masked LM

Pre-training

To replicate the experiments in our paper or to get a pre-trained model instantly, you can download the pre-trained model used in the paper.

To pre-train one youself, follow the instructions of the GitHub repo for BERT.

Put everything related to the pretrained model under model/pre-trained/ directory

Fine-tuning

To produce the fine-tuning examples as described in the paper, run the following commands

$ cd bert_modified
$ python create_data.py -f /path/to/training/data/file
$ python create_tf_record.py --input_file correct.txt --wrong_input_file wrong.txt --output_file tf_examples.tfrecord --vocab_file ../model/pre-trained/vocab.txt

Then, all you need to do is to continue training the pretrained model following the same commands for pretraining described in GitHub repo for BERT except using the pretrained model as the initial checkpoint.

Put the checkpoint files of the fine-tuned model under model/fine-tuned/. Then, set the "fine-tuned" in the faspell_configs.json to the path of the the fine-tuned model.

CSD

The tuning of the filters in CSD may cost you quite some time because it is relatively complicated.

overall settings for tuning CSD

As described in the paper, we need to manually find a filtering curve for each group of candidates. In this code, we include a small hack where each group is divided into two sub groups. And we need to find a filtering curve for each sub group as well. The division criterion is whether the first-rank candidate is different from the original character (i.e., top_difference=True/False). This hack helps because we observe that the value of top_difference has a big impact on the distribution of candidates on the confidence-similarity scatter graphs, as you will observe along the tuning process.

We recommend the order of sub groups to find a filtering curve for to be:

top_difference=True, sim_type='shape', rank=0
top_difference=True, sim_type='shape', rank=1
top_difference=True, sim_type='shape', rank=2
        ...        ,       ...       ,   ...
top_difference=True, sim_type='sound', rank=0
top_difference=True, sim_type='sound', rank=1
top_difference=True, sim_type='sound', rank=2
        ...        ,       ...       ,   ...
top_difference=False, sim_type='shape', rank=0
top_difference=False, sim_type='shape', rank=1
top_difference=False, sim_type='shape', rank=2
        ...        ,       ...       ,   ...
top_difference=False, sim_type='sound', rank=0
top_difference=False, sim_type='sound', rank=1
top_difference=False, sim_type='sound', rank=2

To have sim_type='shape', you need to set "visual": 1 and "phonological": 0 in faspell_configs.json; to have sim_type='sound', you need to set "visual": 0 and "phonological": 1 (keep "union_of_sims": false, which allows the curves for visual similarity and the ones for phonological similarity to be independently tuned.). To have rank=n, set "rank": n.

Before start tuning, note that to use the maked LM to produce the candidates each time may cost you quite some time. Therefore, we recommend saving the candidates in the tuning process for the first group and then reusing them for later groups. Setting the dump_candidates to a saving path will help you save the candidates; for later groups, set read_from_dump to true.

workflow for tuning the filter for each group of candidates

For each top_difference=True sub group of candidates, run:

$ python faspell.py -m e -t -d

or for each top_difference=False sub group of candidates, run:

$ python faspell.py -m e -t

then, you will see the corresponding .png plots under directory plots/. Note that there will be enlarged .png plots for candiates distributed towards the high-confidence and low-similarity corner(right-bottom corner), where candidates are very densely distributed, to help you find the best curve.

For each sub group, you use the plots to find the curve, then put it into the Curves class as a function (returns False if a candidate is below the curve) and call the function in __init__() of the Filter class.

post-tuning settings

When tuning is done, remember to change "union_of_sims" to true (which will forced the union of the results using the two different types of similarity regardless of the value of "visual" and "phonological"). Set "rank" to be the highest rank you tune the filter with. Also, change dump_candidates to '' (empty string) and read_from_dump to false.

More Repositories

1

xHook

🔥 A PLT hook library for Android native ELF.
C
4,063
star
2

xCrash

🔥 xCrash provides the Android app with the ability to capture java crash, native crash and ANR. No root permission or any system permissions are required.
C
3,703
star
3

dpvs

DPVS is a high performance Layer-4 load balancer based on DPDK.
C
3,010
star
4

Andromeda

Andromeda simplifies local/remote communication for Android modularization
Java
2,275
star
5

Qigsaw

🔥🔥Qigsaw ['tʃɪɡsɔ] is a dynamic modularization library which is based on Android App Bundles(Do not need Google Play Service). It supports dynamic delivery for split APKs without reinstalling the base one.
Java
1,676
star
6

Neptune

A flexible, powerful and lightweight plugin framework for Android
Java
763
star
7

libfiber

The high performance c/c++ coroutine/fiber library for Linux/FreeBSD/MacOS/Windows, supporting select/poll/epoll/kqueue/iouring/iocp/windows GUI
C
748
star
8

LiteApp

LiteApp is a high performance mobile cross-platform implementation, The realization of cross-platform functionality is base on webview and provides different ideas and solutions for improve webview performance.
JavaScript
677
star
9

qnsm

QNSM is network security monitoring framework based on DPDK.
C
515
star
10

TaskManager

一种支持依赖关系、任务兜底策略的任务调度管理工具。API灵活易用,稳定可靠。轻松提交主线程任务、异步任务。支持周期性任务,顺序执行任务,并行任务等。
Java
476
star
11

Lens

功能简介:一种开发帮助产品研发的效率工具。主要提供了:页面分析、任务分析、网络分析、DataDump、自定义hook 、Data Explorer 等功能。以帮助开发、测试、UI 等同学更便捷的排查和定位问题,提升开发效率。
Java
407
star
12

dexSplitter

Analyze contribution rate of each module to the apk size
Java
198
star
13

xgboost-serving

A flexible, high-performance serving system for machine learning models
C++
138
star
14

auklet

Auklet is a high performance storage engine based on Openstack Swift
Go
93
star
15

lua-resty-couchbase

Lua couchbase client driver for the ngx_lua based on the cosocket API / 使用cosocket纯lua实现的couchbase的client,已经在爱奇艺重要的服务播放服务稳定运行5年多
Lua
79
star
16

lotus

lotus is a framework for intereact between views in Android
Java
73
star
17

HMGNN

Python
62
star
18

Navi

Java
18
star