• Stars
    star
    231
  • Rank 167,002 (Top 4 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created almost 5 years ago
  • Updated almost 5 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

g2pC: A Context-aware Grapheme-to-Phoneme Conversion module for Chinese

image image image

g2pC: A Context-aware Grapheme-to-Phoneme for Chinese

There are several open source libraries of Chinese grapheme-to-phoneme conversion such as python-pinyin or xpinyin. However, none of them seem to disambiguate Chinese polyphonic words like "行" ("xíng" (go, walk) vs. "háng" (line)) or "了" ("le" (completed action marker) vs. "liǎo" (finish, achieve)). Instead, they pick up the most frequent pronunciation. Although that may be a simple and economic strategy, machine learning techniques can be of help. We use CRF to determine the pronunciation of polyphonic words. In addition to the target word itself and its part-of-speech, which are tagged by pkuseg, its neighboring words are also featurized.

Requirements

  • python >= 3.6
  • pkuseg
  • sklearn_crfsuite

Installation

pip install g2pc

Main Features

  • Disambiguate polyphonic Chinese characters/words and return the most likely pinyin in the context using CRF implemented with sklearn_crfsuite.
  • By associating segmentation results provided by pkuseg with an open-source dictionary CC-CEDICT, display the following comprehensive information.
    • word
    • part-of-speech
    • pinyin
    • descriptive pinyin: where Chinese tone change rules are applied
    • English meaning
    • traditional equivalent

Algorithm (illustrated with an example)

e.g., Input: 我写了几行代码。 (I wrote a few lines of codes.)

  • STEP 1. Segment input string using pkuseg.

    • -> [('我', 'r'), ('写', 'v'), ('了', 'u'), ('几', 'm'), ('行', 'q'), ('代码', 'n'), ('。', 'w')]
  • STEP 2. Look up the CC-CEDICT. Each token, a tuple, consists of word, pos, pronunciation candidates, meaning candidates, traditional character candidates.

    • -> [('我', 'r', ['wo3'], ['/I/me/my/'], ['我']),
      ('写', 'v', ['xie3'], ['/to write/'], ['寫']),
      ('了', 'u', ['le5', 'liao3', 'liao4'], [dal particle ..], ['了', '了', '瞭']),
      ('几', 'm', ['ji3', 'ji1'], ['/how much/..'], ['幾', '几']),
      ('行', 'q', ['xing2', 'hang2'], ['/to walk/.."], ['行', '行']),
      ('代码', 'n', ['dai4 ma3'], ['/code/'], ['代碼']),
      ('。', 'w', ['。'], [''], ['。'])]
  • STEP 3. For polyphonic words, we disambiguate them, using our pre-trained CRF model.

    • -> [('我', 'r', 'wo3', '/I/me/my/', '我'),
      ('写', 'v', 'xie3', '/to write/', '寫'),
      ('了', 'u', 'le5', '/(modal particle ..', '了'),
      ('几', 'm', 'ji3', '/how much/..', '幾'),
      ('行', 'q', 'hang2', "/row/..", '行'),
      ('代码', 'n', 'dai4 ma3', '/code/', '代碼'),
      ('。', 'w', '。', '。', '', '。')]
  • STEP 4. Tone change rules are applied.

    • -> [('我', 'r', 'wo3', 'wo2', '/I/me/my/', '我'),
      ('写', 'v', 'xie3', 'xie3', '/to write/', '寫'),
      ('了', 'u', 'le5', 'le5', '/(modal particle ..', '了'),
      ('几', 'm', 'ji3', 'ji3', '/how much/..', '幾'),
      ('行', 'q', 'hang2', 'hang2, "/row/..", '行'),
      ('代码', 'n', 'dai4 ma3', 'dai4 ma3', '/code/', '代碼'),
      ('。', 'w', '。', '。', '', '。')]

Usage

>>> from g2pc import G2pC
>>> g2p = G2pC()
>>> g2p("一心一意")
# This returns a list of tuples, each of which consists of
# word, pos, pinyin, (tone changed) descriptive pinyin, English meaning, and equivanlent traditional character.
[[('一心一意', 
'i', 
'yi1 xin1 yi1 yi4', 
'yi4 xin1 yi2 yi4', 
"/concentrating one's thoughts and efforts/single-minded/bent on/intently/", 
'一心一意')]

Respectful comparison with other libraries

>>> text1 = "我写了几行代码。" # pay attention to the 行, which should be read as 'hang2', not 'xing2'
>>> text2 = "来不了" # pay attention to the 了, which should be read as 'liao3', not 'le'
# python-pinyin
>>> pip install pypinyin
>>> from pypinyin import pinyin
>>> pinyin(text1)
[['wǒ'], ['xiě'], ['le'], ['jǐ'], ['xíng'], ['dài'], ['mǎ'], ['。']]
>>> pinyin(text2)
[['lái'], ['bù'], ['le']]
# xpinyin
>>> pip install xpinyin
>>> from xpinyin import Pinyin
>>> p = Pinyin()
>>> p.get_pinyin(text1, tone_marks="numbers")  
'wo3-xie3-le5-ji1-xing2-dai4-ma3-。'
>>> p.get_pinyin(text2, tone_marks="numbers")   
'lai2-bu4-le5'
  • Accuracy on internal test set (13,191 syllables)
Model # Correct # Incorrect Acc. (%)
g2pC (0.9.9.3) 13,033 158 98.80
pypinyin (0.35.3) 12,975 216 98.36
xpinyin (0.5.6) 12,838 353 97.32

Accuracy

Changelog

0.9.9.3 July 10, 2019

  • Refined the tone change rules.

0.9.9.2 July 10, 2019

  • Refined the cedict.pkl.

0.9.9.1 July 9, 2019

  • Fixed a bug of failing to find Chinese characters for names. (See this)

0.9.6. July 7, 2019

  • Fixed a bug of failing to converting words not found in the dictionary.
  • Rearragned the cedict.pkl.
  • Refined the CRF model.
  • Added tone change rules. (See this)

0.9.4. July 4, 2019

  • Initial launch

References

If you use our software for research, please cite:

@misc{gp2C2019,
  author = {Park, Kyubyong},
  title = {g2pC},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/Kyubyong/g2pC}}
}

More Repositories

1

transformer

A TensorFlow Implementation of the Transformer: Attention Is All You Need
Python
4,126
star
2

nlp_tasks

Natural Language Processing Tasks and References
3,018
star
3

wordvectors

Pre-trained word vectors of 30+ languages
Python
2,199
star
4

tacotron

A TensorFlow Implementation of Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model
Python
1,818
star
5

numpy_exercises

Numpy exercises.
Python
1,672
star
6

dc_tts

A TensorFlow Implementation of DC-TTS: yet another text-to-speech model
Python
1,147
star
7

sudoku

Can Neural Networks Crack Sudoku?
Python
821
star
8

g2p

g2p: English Grapheme To Phoneme Conversion
Python
734
star
9

tensorflow-exercises

TensorFlow Exercises - focusing on the comparison with NumPy.
Python
535
star
10

deepvoice3

Tensorflow Implementation of Deep Voice 3
Python
452
star
11

css10

CSS10: A Collection of Single Speaker Speech Datasets for 10 Languages
HTML
440
star
12

neural_chinese_transliterator

Can CNNs transliterate Pinyin into Chinese characters correctly?
Python
330
star
13

pytorch_exercises

Jupyter Notebook
312
star
14

bert_ner

Ner with Bert
Python
278
star
15

word_prediction

Word Prediction using Convolutional Neural Networks
Python
251
star
16

nlp_made_easy

Explains nlp building blocks in a simple manner.
Jupyter Notebook
247
star
17

g2pK

g2pK: g2p module for Korean
Python
216
star
18

expressive_tacotron

Tensorflow Implementation of Expressive Tacotron
Python
196
star
19

speaker_adapted_tts

Making a TTS model with 1 minute of speech samples within 10 minutes
184
star
20

neural_japanese_transliterator

Can neural networks transliterate Romaji into Japanese correctly?
Python
173
star
21

tacotron_asr

Speech Recognition Using Tacotron
Python
165
star
22

quasi-rnn

Character-level Neural Translation using Quasi-RNNs
Python
134
star
23

label_smoothing

Corrupted labels and label smoothing
Jupyter Notebook
127
star
24

bert-token-embeddings

Jupyter Notebook
97
star
25

mtp

Multi-lingual Text Processing
95
star
26

cross_vc

Cross-lingual Voice Conversion
Python
94
star
27

name2nat

name2nat: a Python package for nationality prediction from a name
Python
89
star
28

pron_dictionaries

pronunciation dictionaries for multiple languages
Python
79
star
29

msg_reply

a simple message reply suggestion system
Python
78
star
30

word_ordering

Can neural networks order a scramble of words correctly?
Python
74
star
31

kss

Python
70
star
32

neural_tokenizer

Tokenize English sentences using neural networks.
Python
64
star
33

bytenet_translation

A TensorFlow Implementation of Machine Translation In Neural Machine Translation in Linear Time
Python
60
star
34

KoParadigm

KoParadigm: Korean Inflectional Paradigm Generator
Python
54
star
35

specAugment

Tensor2tensor experiment with SpecAugment
Python
46
star
36

vq-vae

A Tensorflow Implementation of VQ-VAE Speaker Conversion
Python
43
star
37

lm_finetuning

Language Model Fine-tuning for Moby Dick
Python
42
star
38

texture_generation

An Implementation of 'Texture Synthesis Using Convolutional Neural Networks' with Kylberg Texture Dataset
Python
33
star
39

integer_sequence_learning

RNN Approaches to Integer Sequence Learning--the famous Kaggle competition
Python
27
star
40

cjk_trans

Pre-trained Machine Translation Models of Korean from/to ECJ
27
star
41

h2h_converter

Convert Sino-Korean words written in Hangul to Chinese characters, which is called hanja in Korean, using neural networks
Python
25
star
42

up_and_running_with_Tensorflow

A simple tutorial of TensorFlow + TensorFlow / NumPy exercises
Jupyter Notebook
13
star
43

neurobind

Yet Another Model Using Neural Networks for Predicting Binding Preferences of for Test DNA Sequences
Python
11
star
44

kollocate

Collocation Search of Korean
Python
9
star
45

kyubyong

9
star
46

WhereAmI

Where Am I? - If you want to meet me.
5
star
47

spam_detection

Spam Dectection Under Semi-supervised settings
5
star
48

helo_word

A Neural Grammatical Error Correction System Built On Better Pre-training and Sequential Transfer Learning
Python
2
star