adversarial-multi-criteria-learning-for-CWS
The implementation of paper https://arxiv.org/abs/1704.07556, ACL 2017
Dependencies
Tensorflow: ==1.0.0
Pandas: >= 0.18.1
numpy: >=1.12.1
File Tree
|-- AdvMulti_model.py
|-- AdvMulti_train.py
|-- Baseline_model.py
|-- Baseline_train.py
|-- config.py
|-- data_as
| |-- dev
| |-- test
| |-- test_gold
| |-- train
| |-- words
| `-- words_for_training
|-- data_cityu
| |-- dev
| |-- test
| |-- test_gold
| |-- train
| |-- words
| `-- words_for_training
|-- data_ckip
| |-- dev
| |-- test
| |-- test_gold
| |-- train
| |-- words
| `-- words_for_training
|-- data_ctb
| |-- dev
| |-- test
| |-- test_gold
| |-- train
| |-- words
| `-- words_for_training
|-- data_helpers.py
|-- data_msr
| |-- dev
| |-- test
| |-- test_gold
| |-- train
| |-- words
| `-- words_for_training
|-- data_ncc
| |-- dev
| |-- test
| |-- test_gold
| |-- train
| |-- words
| `-- words_for_training
|-- data_pku
| |-- dev
| |-- test
| |-- test_gold
| |-- train
| |-- words
| `-- words_for_training
|-- data_sxu
| |-- dev
| |-- test
| |-- test_gold
| |-- train
| |-- words
| `-- words_for_training
|-- data_weibo
| |-- dev
| |-- test
| |-- test_gold
| |-- train
| |-- words
| `-- words_for_training
|-- models
| |-- cws_msr
| | `-- checkpoints
| |-- cws_ncc
| | `-- checkpoints
| |-- cws_sxu
| | `-- checkpoints
| |-- multi_model9
| | `-- checkpoints
| |-- train_words
| `-- vec100.txt
|-- prepare_data_index.py
|-- prepare_train_words.py
`-- voc.py
Data Format
For dev, train, test in each data_directory, its format is:
1995#<NUM>#B_NT
The first one is the original char(1995), the second one is the processed char(<NUM>), the last one is the segmentation tag and POS(B_NT). The POS information is not needed in the paper, its just for the convenience of the expand use.
For words in each data_directory, it is a dict for words:
平定 费尔南多·安特萨纳 北京索有文化传播有限公司
For words_for_training in each data_directory, it format is:
LC 过后 28
LC is POS, ‘过后’ is the bigram we extracted, 28 means its frequency in the specific dataset. The POS information is not needed in the paper, its just for the convenience of the expand use.
For vec100.txt is the embeding file generated by word2vec toolkit
Here is the link: https://pan.baidu.com/s/1jHHdzmA
Code Usage
prepare_data_index.py is used produce .csv that is used as direct input
prepare_train_words.py is used for generating words (need to be trained) beyond specific frequency in Multi-task learning.
AdvMulti_model.py & AdvMulti_train.py are paired model and train file
Baseline_model.py & Baseline_train.py are paired model and train file
Run
The hyper parameters are defined in config.py and tf.FLAGS
When you have all necessary files:
For baseline train:
CUDA_VISIBLE_DEVICES=0 python Baseline_train.py
For adversarial multi_task train:
CUDA_VISIBLE_DEVICES=0 python AdvMulti_train.py