• Stars
    star
    1,446
  • Rank 32,535 (Top 0.7 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created over 7 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Turn Chinese natural language into structured data 中文自然语言理解

Rasa NLU for Chinese, a fork from RasaHQ/rasa_nlu.

Please refer to newest instructions at official Rasa NLU document

中文Blog

Files you should have:

  • data/total_word_feature_extractor_zh.dat

Trained from Chinese corpus by MITIE wordrep tools (takes 2-3 days for training)

For training, please build the MITIE Wordrep Tool. Note that Chinese corpus should be tokenized first before feeding into the tool for training. Close-domain corpus that best matches user case works best.

A trained model from Chinese Wikipedia Dump and Baidu Baike can be downloaded from 中文Blog.

  • data/examples/rasa/demo-rasa_zh.json

Should add as much examples as possible.

Usage:

  1. Clone this project, and run
python setup.py install
  1. Modify configuration.

    Currently for Chinese we have two pipelines:

    Use MITIE+Jieba (sample_configs/config_jieba_mitie.yml):

language: "zh"

pipeline:
- name: "nlp_mitie"
  model: "data/total_word_feature_extractor_zh.dat"
- name: "tokenizer_jieba"
- name: "ner_mitie"
- name: "ner_synonyms"
- name: "intent_entity_featurizer_regex"
- name: "intent_classifier_mitie"

RECOMMENDED: Use MITIE+Jieba+sklearn (sample_configs/config_jieba_mitie_sklearn.yml):

language: "zh"

pipeline:
- name: "nlp_mitie"
  model: "data/total_word_feature_extractor_zh.dat"
- name: "tokenizer_jieba"
- name: "ner_mitie"
- name: "ner_synonyms"
- name: "intent_entity_featurizer_regex"
- name: "intent_featurizer_mitie"
- name: "intent_classifier_sklearn"
  1. (Optional) Use Jieba User Defined Dictionary or Switch Jieba Default Dictionoary:

    You can put in file path or directory path as the "user_dicts" value. (sample_configs/config_jieba_mitie_sklearn_plus_dict_path.yml)

language: "zh"

pipeline:
- name: "nlp_mitie"
  model: "data/total_word_feature_extractor_zh.dat"
- name: "tokenizer_jieba"
  default_dict: "./default_dict.big"
  user_dicts: "./jieba_userdict"
#  user_dicts: "./jieba_userdict/jieba_userdict.txt"
- name: "ner_mitie"
- name: "ner_synonyms"
- name: "intent_entity_featurizer_regex"
- name: "intent_featurizer_mitie"
- name: "intent_classifier_sklearn"
  1. Train model by running:

    If you specify your project name in configure file, this will save your model at /models/your_project_name.

    Otherwise, your model will be saved at /models/default

python -m rasa_nlu.train -c sample_configs/config_jieba_mitie_sklearn.yml --data data/examples/rasa/demo-rasa_zh.json --path models
  1. Run the rasa_nlu server:
python -m rasa_nlu.server -c sample_configs/config_jieba_mitie_sklearn.yml --path models
  1. Open a new terminal and now you can curl results from the server, for example:
$ curl -XPOST localhost:5000/parse -d '{"q":"我发烧了该吃什么药?", "project": "rasa_nlu_test", "model": "model_20170921-170911"}' | python -mjson.tool
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   652    0   552  100   100    157     28  0:00:03  0:00:03 --:--:--   157
{
    "entities": [
        {
            "end": 3,
            "entity": "disease",
            "extractor": "ner_mitie",
            "start": 1,
            "value": "发烧"
        }
    ],
    "intent": {
        "confidence": 0.5397186422631861,
        "name": "medical"
    },
    "intent_ranking": [
        {
            "confidence": 0.5397186422631861,
            "name": "medical"
        },
        {
            "confidence": 0.16206323981749196,
            "name": "restaurant_search"
        },
        {
            "confidence": 0.1212448457737397,
            "name": "affirm"
        },
        {
            "confidence": 0.10333600028547868,
            "name": "goodbye"
        },
        {
            "confidence": 0.07363727186010374,
            "name": "greet"
        }
    ],
    "text": "我发烧了该吃什么药?"
}

More Repositories

1

Awesome-Chinese-NLP

A curated list of resources for Chinese NLP 中文自然语言处理相关资料
7,350
star
2

Information-Extraction-Chinese

Chinese Named Entity Recognition with IDCNN/biLSTM+CRF, and Relation Extraction with biGRU+2ATT 中文实体识别与关系提取
Python
2,113
star
3

Small-Chinese-Corpus

Some useful Chinese corpus datasets 中文语料小数据
519
star
4

Somiao-Pinyin

Somiao Pinyin: Train your own Chinese Input Method with Seq2seq Model 搜喵拼音输入法
Python
252
star
5

text2vec

Easily generate document/paragraph/sentence vectors and calculate similarity.
Python
133
star
6

Chinese-VQA

Chinese Visual Question Answering 中文看图问答
Python
43
star
7

federated_shap

Code for paper "Interpret Federated Learning with Shapley Values"
Jupyter Notebook
32
star
8

hk_ipo_prediction

Predict first day performance of Hong Kong IPO stocks: A pipeline example of machine learning projects
Jupyter Notebook
24
star
9

Geetest-Captcha-Crack

Geetest Captcha Crack 为了不被怪兽吃掉而奋斗!
Python
21
star
10

aiml_chatbot

AIML based chatbot
Python
20
star
11

lstm_text_generation_chinese

Chinese Text Generation using LSTM
Python
11
star
12

Responsible-AI

This is a demo project of using Responsible AI technology provided by Google to build responsible machine learning applications.
Jupyter Notebook
6
star
13

crownpku.github.io

personal blog
JavaScript
6
star
14

learning_materials

A collection of personal learning materials.
4
star
15

Question_Answering_UI

A Simple UI based on Dash for Question Answering
Python
4
star
16

end_to_end_cnn_captcha

End to end Captcha Crack with CNN
Python
3
star
17

share_everything

wechat public account share_everything code
Python
3
star
18

Awesome-Insurance

A curated list of insurance related technology across the business line
3
star
19

sen_simi_cal

Calculate sentence similarity by word vector
Python
2
star
20

DamageSpreading

C++
1
star
21

CommunityDetection

CommunityDetection
C++
1
star
22

ParallelGA

Objective-C
1
star