• Stars
    star
    441
  • Rank 98,861 (Top 2 %)
  • Language
    Python
  • Created almost 4 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

ChineseNMT: Translate English to Chinese with PyTorch Implementation of Transformer

Language: 简体中文 | English

ChineseNMT

基于transformer的英译中翻译模型🤗

项目说明参考知乎文章:教你用PyTorch玩转Transformer英译中翻译模型!

Data

The dataset is from WMT 2018 Chinese-English track (Only NEWS Area)

Data Process

分词

  • 工具:sentencepiece
  • 预处理:./data/get_corpus.py抽取train、dev和test中双语语料,分别保存到corpus.encorpus.ch中,每行一个句子。
  • 训练分词模型:./tokenizer/tokenize.py中调用了sentencepiece.SentencePieceTrainer.Train()方法,利用corpus.encorpus.ch中的语料训练分词模型,训练完成后会在./tokenizer文件夹下生成chn.modelchn.vocabeng.modeleng.vocab,其中.model.vocab分别为模型文件和对应的词表。

Model

采用Harvard开源的 transformer-pytorch ,中文说明可参考 传送门

Requirements

This repo was tested on Python 3.6+ and PyTorch 1.5.1. The main requirements are:

  • tqdm
  • pytorch >= 1.5.1
  • sacrebleu >= 1.4.14
  • sentencepiece >= 0.1.94

To get the environment settled quickly, run:

pip install -r requirements.txt

Usage

模型参数在config.py中设置。

  • 由于transformer显存要求,支持MultiGPU,需要设置config.py中的device_id列表以及main.py中的os.environ['CUDA_VISIBLE_DEVICES']

如要运行模型,可在命令行输入:

python main.py

实验结果在./experiment/train.log文件中,测试集翻译结果在./experiment/output.txt中。

在两块GeForce GTX 1080 Ti上运行,每个epoch用时一小时左右。

Results

Model NoamOpt LabelSmoothing Best Dev Bleu Test Bleu
1 No No 24.07 24.03
2 Yes No 26.08 25.94
3 No Yes 23.92 23.84

Pretrained Model

训练好的 Model 2 模型(当前最优模型)可以在如下链接直接下载😊

链接: https://pan.baidu.com/s/1RKC-HV_UmXHq-sy1-yZd2Q 密码: g9wl

Beam Search

当前最优模型(Model 2)使用beam search测试的结果

Beam_size 2 3 4 5
Test Bleu 26.59 26.80 26.84 26.86

One Sentence Translation

将训练好的model或者上述Pretrained model以model.pth命名,保存在./experiment路径下。在main.py中运行translate_example,即可实现单句翻译。

如英文输入单句为:

The near-term policy remedies are clear: raise the minimum wage to a level that will keep a fully employed worker and his or her family out of poverty, and extend the earned-income tax credit to childless workers.

ground truth为:

近期的政策对策很明确:把最低工资提升到足以一个全职工人及其家庭免于贫困的水平,扩大对无子女劳动者的工资所得税减免。

beam size = 3的翻译结果为:

短期政策方案很清楚:把最低工资提高到充分就业的水平,并扩大向无薪工人发放所得的税收信用。

Mention

The codes released in this reposity are only tested successfully with Linux. If you wanna try it with Windows, steps below may be useful to you as mentioned in issue 2:

  1. adding utf-8 encoding declaration:

    in lines 16 and 19 of get_corpus.py:

    with open(ch_path, "w", encoding="utf-8") as fch:
    with open(en_path, "w", encoding="utf-8") as fen:
    

    in line 165 of train.py:

    with open(config.output_path, "w", encoding="utf-8") as fp:
    
  2. using conda command to install sacrebleu if Anoconda is used for building your virtual env:

    conda install -c conda-forge sacrebleu
    

For any other problems you meet when doing your own project, welcome to issuing or sending emails to me 😊~

More Repositories

1

CLUENER2020

A PyTorch implementation of a BiLSTM\BERT\Roberta(+CRF) model for Named Entity Recognition.
Python
467
star
2

SpeculativeDecodingPapers

📰 Must-read papers and blogs on Speculative Decoding ⚡️
291
star
3

WordSeg

A PyTorch implementation of a BiLSTM \ BERT \ Roberta (+ BiLSTM + CRF) model for Chinese Word Segmentation (中文分词) .
Python
196
star
4

Spec-Bench

Spec-Bench: A Comprehensive Benchmark and Unified Evaluation Platform for Speculative Decoding (ACL 2024 Findings)
Python
129
star
5

SpecDec

Codes for our paper "Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation" (EMNLP 2023 Findings)
Python
29
star
6

CDec

Codes for our paper "Enhancing Continual Relation Extraction via Classifier Decomposition" (Findings of ACL2023)
Python
11
star
7

ImageNetVC

Codes and datasets for our paper "ImageNetVC: Zero- and Few-Shot Visual Commonsense Evaluation on 1000 ImageNet Categories" (EMNLP 2023 Findings).
Python
10
star
8

improved-gwcnn

Code for our paper: Improved deep learning techniques in gravitational-wave data analysis.
Python
10
star
9

LoveTalker

LoveTalker implemented by PyTorch and Tensorflow2.x.(基于charRNN的简易中文情话生成模型)
Python
8
star
10

vilbert-multi-task

vilbert-multi-task install Instruction
Jupyter Notebook
7
star
11

SWIFT

SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration
Python
3
star
12

MemoryLMPapers

1
star
13

Amazon-2

This is our repository of Amazon-2 text classification task.
Python
1
star
14

NERmultimodal

ReImplementation of Adaptive Co-attention Network for Named Entity Recognition in Tweets in AAAI2018.
Python
1
star
15

Awesome-Multimodal-Research

This is the Multimodal Research Note Repository of Group Sui in Peking University👏.
1
star
16

hemingkx

1
star