• Stars
    star
    252
  • Rank 156,103 (Top 4 %)
  • Language
    Python
  • Created over 6 years ago
  • Updated about 4 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Somiao Pinyin: Train your own Chinese Input Method with Seq2seq Model 搜喵拼音输入法

Somiao Pinyin: Train your own Chinese Input Method with Seq2seq Model

中文Blog

Personalized Chinese Pinyin Input Method with Seq2seq model

Original code in https://github.com/Kyubyong/neural_chinese_transliterator for research purpose.

This repository intends to experiment with different training data and interactive user inputs, and possibly develop towards a real data-personalized and model-localized Pinyin Input product.

Requrements

  • Python (>=3.5)

  • TensorFlow (>=r1.2)

  • xpinyin (for Chinese pinyin annotation)

  • distance (for calculating the similarity score between two strings)

  • tqdm

Usage

Training:

  • STEP 1. Download Leipzig Chinese Corpus

    Extract it and copy zho_news_2007-2009_1M-sentences.txt to data/ folder.

    Or use your own Chinese Corpus with the same format.

  • STEP 2. Build a Pinyin-Chinese parallel corpus.

#python3 build_corpus.py
  • STEP 3. Run prepro.py to make vocabulary and training data.
#python3 prepro.py
  • STEP 4. Adjust hyperparameters in hyperparams.py if necessary.

  • STEP 5. Train the model

#python3 train.py

Inference with command line input:

For command line input testing, run:

python3 eval.py

You may change the main function name to use the original testing data evaluation.

Testing with pre-trained models:

Download the pre-trained model from blog, unzip it to generate /log and /data.

Remember to overwrite the pickle files in /data with the pre-trained model data.

Then run for command line input testing:

python3 eval.py

Sample Results

Model is trained from Chinese News in 2007-2009. So many now common Chinese sayings are not learned.

请输入测试拼音:nihao
你好

请输入测试拼音:chenggongle
成功了

请输入测试拼音:wolegequ
我了个曲

请输入测试拼音:taibangla
太棒啦

请输入测试拼音:dacolehuizenmeyang
打破了会怎么样

请输入测试拼音:pujinghehujintaotongdianhua
普京和胡锦涛通电话

请输入测试拼音:xiangbuqilaishinianqianfashengleshenme
想不起来十年前发生了什么

请输入测试拼音:meiguohongzhawomenzainansilafudedashiguan
美国轰炸我们在南斯拉夫的大事馆

请输入测试拼音:liudehuanageshihouhaonianqing
刘德华那个时候好年轻

请输入测试拼音:shishihouxunlianyixiabilibilideyuliaole
是时候训练一下比例比例的预料了

TODOLIST

  • Pretrained models on different contexts

  • Model selection for using different models while input different things (chatting? writing scientific papers? etc...)

  • Function to record LOCALLY what user has input as personalized corpus

  • User Interface

  • ...

More Repositories

1

Awesome-Chinese-NLP

A curated list of resources for Chinese NLP 中文自然语言处理相关资料
7,350
star
2

Information-Extraction-Chinese

Chinese Named Entity Recognition with IDCNN/biLSTM+CRF, and Relation Extraction with biGRU+2ATT 中文实体识别与关系提取
Python
2,113
star
3

Rasa_NLU_Chi

Turn Chinese natural language into structured data 中文自然语言理解
Python
1,446
star
4

Small-Chinese-Corpus

Some useful Chinese corpus datasets 中文语料小数据
519
star
5

text2vec

Easily generate document/paragraph/sentence vectors and calculate similarity.
Python
133
star
6

Chinese-VQA

Chinese Visual Question Answering 中文看图问答
Python
43
star
7

federated_shap

Code for paper "Interpret Federated Learning with Shapley Values"
Jupyter Notebook
32
star
8

hk_ipo_prediction

Predict first day performance of Hong Kong IPO stocks: A pipeline example of machine learning projects
Jupyter Notebook
24
star
9

Geetest-Captcha-Crack

Geetest Captcha Crack 为了不被怪兽吃掉而奋斗!
Python
21
star
10

aiml_chatbot

AIML based chatbot
Python
20
star
11

lstm_text_generation_chinese

Chinese Text Generation using LSTM
Python
11
star
12

Responsible-AI

This is a demo project of using Responsible AI technology provided by Google to build responsible machine learning applications.
Jupyter Notebook
6
star
13

crownpku.github.io

personal blog
JavaScript
6
star
14

learning_materials

A collection of personal learning materials.
4
star
15

Question_Answering_UI

A Simple UI based on Dash for Question Answering
Python
4
star
16

end_to_end_cnn_captcha

End to end Captcha Crack with CNN
Python
3
star
17

share_everything

wechat public account share_everything code
Python
3
star
18

Awesome-Insurance

A curated list of insurance related technology across the business line
3
star
19

sen_simi_cal

Calculate sentence similarity by word vector
Python
2
star
20

DamageSpreading

C++
1
star
21

CommunityDetection

CommunityDetection
C++
1
star
22

ParallelGA

Objective-C
1
star