• Stars
    star
    102
  • Rank 335,584 (Top 7 %)
  • Language
  • License
    MIT License
  • Created almost 6 years ago
  • Updated almost 5 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A collection of transformer's guides, implementations and variants.

A collection of transformer's guides, implementations and so on(For those who want to do some research using transformer as a baseline or simply reproduce paper's performance).

Please feel free to pull requests or report issues.

Why this project

Transformer is a powerful model applied in sequence to sequence learning. However, when we were using transformer as our baseline in NMT research we found no good & reliable guide to reproduce approximate result as reported in original paper(even official tensor2tensor implementation), which means our research would be unauthentic. We collected some implementations, obtained corresponding performance-reproducable approaches and other materials, which eventually formed this project.

Papers

NMT Basic

Transformer original paper

Implementations & How to reproduce paper's result

Indeed there are lots of transformer implementations on the Internet, in order to simplify learning curve, here we only include the most valuable projects.

[Note]: In transformer original paper, there are WMT14 English-German, WMT14 English-French two results transformer result Here we regard a implementation as performance-reproducable if there exists approaches to reproduce WMT14 English-German BLEU score. Therefore, we'll also support corresponding approach to reproduce WMT14 English-German result.

Minimal, paper-equavalent but not certainly performance-reproducable implementations(both PyTorch implementations)

  1. attention-is-all-you-need-pytorch

  2. Harvard NLP Group's annotation

Complex, performance-reproducable implementations

Because transformer's original implementation should run on 8 GPU to replicate corresponding result, where each GPU loads one batch and after forward propagation 8 batch's loss is summed to execute backward operation, so we can accumulate every 8 batch's loss to execute backward operation if we only have 1 GPU to imitate this process. You'd better assemble gpu_count, tokens_on_each_gpu and gradient_accumulation_count to satisfy gpu_count * tokens_on_each_gpu * gradient_accumulation_count = 4096 * 8. See each implementation's guide for details.

Although original paper used multi-bleu.perl to evaluate bleu score, we recommend using sacrebleu, which should be equivalent to mteval-v13a.pl but more convenient, to calculate bleu score and report the signature as BLEU+case.mixed+lang.de-en+test.wmt17 = 32.97 66.1/40.2/26.6/18.1 (BP = 0.980 ratio = 0.980 hyp_len = 63134 ref_len = 64399) for easy reproduction.

# calculate lowercase bleu on all tokenized text
cat model_prediction | sacrebleu -tok none -lc ground_truth
# calculate lowercase bleu on all tokenized text if you have 3 ground truth
cat model_prediction | sacrebleu -tok none -lc ground_truth_1 ground_truth_2 ground_truth_3 
# calculate lowercase bleu on all untokenized romance-language text using v13a tokenization
cat model_prediction | sacrebleu -tok 13a -lc ground_truth
# calculate lowercase bleu on all untokenized romance-language text using v14 tokenization
cat model_prediction | sacrebleu -tok intl -lc ground_truth

The transformer paper's original model settings can be found in tensor2tensor transformer.py. For example, You can find base model configs intransformer_base function.

As you can see, OpenNMT-tf also has a replicable instruction but we prefer tensor2tensor as a baseline to reproduce paper's result if we have to use TensorFlow since it is official.

Paper's original implementation: tensor2tensor(using TensorFlow)

Code
Code annotation
Steps to reproduce WMT14 English-German result:

(updated on v1.10.0)

# 1. Install tensor2tensor toolkit
pip install tensor2tensor

# 2. Basic config
# For BPE model use this problem
PROBLEM=translate_ende_wmt_bpe32k
MODEL=transformer
HPARAMS=transformer_base
# or use transformer_large to reproduce large model
# HPARAMS=transformer_large
DATA_DIR=$HOME/t2t_data
TMP_DIR=/tmp/t2t_datagen
TRAIN_DIR=$HOME/t2t_train/$PROBLEM/$MODEL-$HPARAMS

mkdir -p $DATA_DIR $TMP_DIR $TRAIN_DIR

# 3. Download and preprocess corpus
# Note that tensor2tensor has an inner tokenizer
t2t-datagen \
  --data_dir=$DATA_DIR \
  --tmp_dir=$TMP_DIR \
  --problem=$PROBLEM

# 4. Train on 8 GPUs. You'll get nearly expected performance after ~250k steps and certainly expected performance after ~500k steps.
t2t-trainer \
  --data_dir=$DATA_DIR \
  --problem=$PROBLEM \
  --model=$MODEL \
  --hparams_set=$HPARAMS \
  --output_dir=$TRAIN_DIR \ 
  --train_steps=600000

# 5. Translate
DECODE_FILE=$TMP_DIR/newstest2014.tok.bpe.32000.en
BEAM_SIZE=4
ALPHA=0.6

t2t-decoder \
  --data_dir=$DATA_DIR \
  --problem=$PROBLEM \
  --model=$MODEL \
  --hparams_set=$HPARAMS \
  --output_dir=$TRAIN_DIR \
  --decode_hparams="beam_size=$BEAM_SIZE,alpha=$ALPHA" \
  --decode_from_file=$DECODE_FILE \
  --decode_to_file=$TMP_DIR/newstest2014.en.tok.32kbpe.transformer_base.beam5.alpha0.6.decode

# 6. Debpe
cat $TMP_DIR/newstest2014.en.tok.32kbpe.transformer_base.beam5.alpha0.6.decode | sed 's/@@ //g' > $TMP_DIR/newstest2014.en.tok.32kbpe.transformer_base.beam5.alpha0.6.decode.debpe
# Do compound splitting on the translation
perl -ple 's{(\S)-(\S)}{$1 ##AT##-##AT## $2}g' < $TMP_DIR/newstest2014.en.tok.32kbpe.transformer_base.beam5.alpha0.6.decode.debpe > $TMP_DIR/newstest2014.en.tok.32kbpe.transformer_base.beam5.alpha0.6.decode.debpe.atat
# Do same compound splitting on the ground truth and then score bleu
# ...

Note that step 6 remains a postprocessing. For some historical reasons, Google split compound words before getting the final BLEU results which will bring moderate increase. see get_ende_bleu.sh for more details.

If you have only 1 GPU, you can use transformer_base_multistep8 hparams to imitate 8 GPU.

transformer_base_multistep8

You can also modify transformer_base_multistep8 function to accumulate gradient times you want. Here is an example using 4 GPU to run transformer big model. Note that hparams.optimizer_multistep_accumulate_steps = 2 since we only need to accumulate gradient twice for 4 GPU.

@registry.register_hparams
def transformer_base_multistep8():
 """HParams for simulating 8 GPUs with MultistepAdam optimizer."""
 hparams = transformer_big()
 hparams.optimizer = "MultistepAdam"
 hparams.optimizer_multistep_accumulate_steps = 2
 return hparams
Resources

Harvard NLP Group's implementation: OpenNMT-py(using PyTorch)

Code
Steps to reproduce WMT14 English-German result:

(updated on v0.5.0)

For command arguments meaning, see OpenNMT-py doc or OpenNMT-py opts.py

  1. Download corpus preprocessed by OpenNMT, sentencepiece model preprocessed by OpenNMT. Note that the preprocess procedure includes tokenization, bpe/word-piece operation(here using sentencepiece powered by Google which implements word-piece algorithm), see OpenNMT-tf script for more details.

  2. Preprocess. Because English and German are similar languages here we use -share_vocab to share vocabulary between source language and target language, which means you don't need to set this flag for distant language pairs such as Chinese-English. Meanwhile, we use a max sequence length of 100 to cover almostly all sentences on the basis of sentence length distribution of corpus. For example:

    python preprocess.py \
        -train_src ../wmt-en-de/train.en.shuf \
        -train_tgt ../wmt-en-de/train.de.shuf \
        -valid_src ../wmt-en-de/valid.en \
        -valid_tgt ../wmt-en-de/valid.de \
        -save_data ../wmt-en-de/processed \
        -src_seq_length 100 \
        -tgt_seq_length 100 \
        -max_shard_size 200000000 \
        -share_vocab
  3. Train. For example, if you only have 4 GPU:

    python  train.py -data /tmp/de2/data -save_model /tmp/extra \
        -layers 6 -rnn_size 512 -word_vec_size 512 -transformer_ff 2048 -heads 8  \
        -encoder_type transformer -decoder_type transformer -position_encoding \
        -train_steps 200000  -max_generator_batches 2 -dropout 0.1 \
        -batch_size 4096 -batch_type tokens -normalization tokens  -accum_count 2 \
        -optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2 \
        -max_grad_norm 0 -param_init 0  -param_init_glorot \
        -label_smoothing 0.1 -valid_steps 10000 -save_checkpoint_steps 10000 \
        -world_size 4 -gpu_ranks 0 1 2 3 

    Note that here -accum_count means every N batches accumulating loss to backward, so it's 2 for 4 GPUs and so on.

  4. Translate. For example:
    You can set -batch_size(default 30) larger to boost the translation.

    python translate.py -gpu 0 -replace_unk -alpha 0.6 -beta 0.0 -beam_size 5 -length_penalty wu -coverage_penalty wu \
         -share_vocab vocab_file -max_length 200 -model model_file -src newstest2014.en.32kspe -output model.pred -verbose

    Note that testset in corpus preprocessed by OpenNMT is newstest2017 while it is newstest2014 in original paper, which may be a mistake. To obtain newstest2014 testset as in paper, here we can use sentencepiece to encode newstest2014.en manually. You can find <model_file>in step 1's downloaded archive.

    spm_encode --model=<model_file> --output_format=piece < newstest2014.en > newstest2014.en.32kspe
  5. Detokenization. Since training data is processed by sentencepiece, step 4's translation should be sentencepiece-encoded style, so we need a decoding procedure to obtain a detokenized plain prediction. For example:

    spm_decode --model=<model_file> --input_format=piece < input > output
  6. Postprocess

There is also a bpe-version WMT'16 ENDE corpus preprocessed by Google. See subword-nmt for bpe encoding and decoding.

Resources

FAIR's implementation: fairseq-py(using PyTorch)

Code
Steps to reproduce WMT14 English-German result:

(updated on commit 7e60d45)

For arguments meaning, see doc. Note that we can use --update-freq when training to accumulate every N batches loss to backward, so it's 8 for 1 GPU, 2 for 4 GPUs and so on.

  1. Download the preprocessed WMT'16 EN-DE data provided by Google and extract it.

    TEXT=wmt16_en_de_bpe32k
    mkdir $TEXT
    tar -xzvf wmt16_en_de.tar.gz -C $TEXT
    
  2. Preprocess the dataset with a joined dictionary

    python preprocess.py --source-lang en --target-lang de \
            --trainpref $TEXT/train.tok.clean.bpe.32000 \
            --validpref $TEXT/newstest2013.tok.bpe.32000 \
            --testpref $TEXT/newstest2014.tok.bpe.32000 \
            --destdir data-bin/wmt16_en_de_bpe32k \
            --joined-dictionary
    
  3. Train. For a base model.

    # train about 180k steps
    python train.py data-bin/wmt16_en_de_bpe32k \
        --arch transformer_wmt_en_de --share-all-embeddings \
        --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
        --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \
        --lr 0.0007 --min-lr 1e-09 \
        --weight-decay 0.0 --criterion label_smoothed_cross_entropy \ 
        --label-smoothing 0.1 --max-tokens 4096 --update-freq 2 \
        --no-progress-bar --log-format json --log-interval 10 --save-interval-updates 1000 \
        --keep-interval-updates 5
    # average last 5 checkpoints
    modelfile=checkpoints
    python scripts/average_checkpoints.py --inputs $modelfile --num-update-checkpoints 5 \
        --output $modelfile/average-model.pt
    

    For a big model.

    # train about 270k steps
    python train.py data-bin/wmt16_en_de_bpe32k \
        --arch transformer_vaswani_wmt_en_de_big --share-all-embeddings \
        --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
        --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \
        --lr 0.0005 --min-lr 1e-09 \
        --weight-decay 0.0 --criterion label_smoothed_cross_entropy \		
        --label-smoothing 0.1 --max-tokens 4096 --update-freq 2\
        --no-progress-bar --log-format json --log-interval 10 --save-interval-updates 1000 \
        --keep-interval-updates 20
    # average last 20 checkpoints
    modelfile=checkpoints
    python scripts/average_checkpoints.py --inputs $modelfile --num-update-checkpoints 20 \ 
        --output $modelfile/average-model.pt
    
  4. Inference

    model=average-model.pt
    subset=test
    python generate.py data-bin/wmt16_en_de_bpe32k --path $modelfile/$model \
        --gen-subset $subset --beam 4 --batch-size 128 --remove-bpe --lenpen 0.6 > pred.de
    # because fairseq's output is unordered, we need to recover its order
    grep ^H pred.de | cut -f1,3- | cut -c3- | sort -k1n | cut -f2- > pred.de
    
  5. Postprocess

Resources

Complex, not certainly performance-reproducable implementations

  • Marian(purely c++ implementation without any deep learning framework)

Training tips

Further

Contributors

This project is developed and maintained by Natural Language Processing Group, ICT/CAS.

More Repositories

1

LLaMA-Omni

LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve speech capabilities at the GPT-4o level.
Python
1,889
star
2

StreamSpeech

StreamSpeech is an “All in One” seamless model for offline and simultaneous speech recognition, speech translation and speech synthesis.
Python
882
star
3

BayLing

“百聆”是一个基于LLaMA的语言对齐增强的英语/中文大语言模型,具有优越的英语/中文能力,在多语言和通用任务等多项测试中取得ChatGPT 90%的性能。BayLing is an English/Chinese LLM equipped with advanced language alignment, showing superior capability in English/Chinese generation, instruction following and multi-turn interaction.
Python
294
star
4

TruthX

Code for ACL 2024 paper "TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space"
Python
105
star
5

DialoFlow

Code for ACL 2021 main conference paper "Conversations Are Not Flat: Modeling the Dynamic Information Flow across Dialogue Utterances".
Python
93
star
6

NAST-S2x

A fast speech-to-any translation model that supports simultaneous decoding and offers 28× speedup.
Python
60
star
7

DASpeech

Code for NeurIPS 2023 paper "DASpeech: Directed Acyclic Transformer for Fast and High-quality Speech-to-Speech Translation".
Python
56
star
8

DSTC8-AVSD

We rank the 1st in DSTC8 Audio-Visual Scene-Aware Dialog competition. This is the source code for our IEEE/ACM TASLP (AAAI2020-DSTC8-AVSD) paper "Bridging Text and Video: A Universal Multimodal Transformer for Video-Audio Scene-Aware Dialog".
Python
55
star
9

OR-NMT

Source Code for ACL2019 paper <Bridging the Gap between Training and Inference for Neural Machine Translation>
Python
42
star
10

STEMM

Code for ACL 2022 main conference paper "STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation".
Python
35
star
11

DiSeg

Source code for ACL 2023 paper "End-to-End Simultaneous Speech Translation with Differentiable Segmentation"
Python
32
star
12

BoN-NAT

Python
23
star
13

Seq-NAT

Source code for <Sequence-Level Training for Non-Autoregressive Neural Machine Translation>.
Python
23
star
14

PLUVR

Code for ACL 2022 main conference paper "Neural Machine Translation with Phrase-Level Universal Visual Representations".
Python
21
star
15

HMT

Source code for ICLR 2023 spotlight paper "Hidden Markov Transformer for Simultaneous Machine Translation"
Python
21
star
16

ComSpeech

Code for ACL 2024 main conference paper "Can We Achieve High-quality Direct Speech-to-Speech Translation Without Parallel Speech Data?".
Python
21
star
17

TLAT-NMT

Source code for the EMNLP 2020 long paper <Token-level Adaptive Training for Neural Machine Translation>.
Python
20
star
18

NMLA-NAT

Code for NeurIPS 2022 Spotlight paper " Non-Monotonic Latent Alignments for CTC-Based Non-Autoregressive Machine Translation"
Python
20
star
19

AIH

Code for Findings of ACL 2021 paper "Addressing Inquiries about History: An Efficient and Practical Framework for Evaluating Open-domain Chatbot Consistency".
Python
19
star
20

RSI-NAT

Source code for "Retrieving Sequential Information for Non-Autoregressive Neural Machine Translation"
Python
19
star
21

DiverseNMT

Source code for the AAAI 2020 long paper <Modeling Fluency and Faithfulness for Diverse Neural Machine Translation>.
Python
19
star
22

PTE-NMT

Source code for the NAACL 2021 paper: Pruning-then-Expanding Model for Domain Adaptation of Neural Machine Translation
Python
15
star
23

CMOT

Code for ACL 2023 main conference paper "CMOT: Cross-modal Mixup via Optimal Transport for Speech Translation"
Python
14
star
24

LNMT-CA

Code for EMNLP 2022 main conference paper "Low-resource Neural Machine Translation with Cross-modal Alignment".
Jupyter Notebook
14
star
25

CRESS

Code for ACL 2023 main conference paper "Understanding and Bridging the Modality Gap for Speech Translation".
Python
14
star
26

ITST

Code for EMNLP 2022 main conference paper "Information-Transport-based Policy for Simultaneous Translation"
Python
14
star
27

SiLLM

SiLLM is a Simultaneous Machine Translation (SiMT) Framework. It utilizes a Large Language model as the translation model and employs a traditional SiMT model for policy-decision to achieve SiMT through collaboration.
Python
14
star
28

TACS

Source code for Truth-Aware Context Selection: Mitigating the Hallucinations of Large Language Models Being Misled by Untruthful Contexts
Python
13
star
29

BT4ST

Code for ACL 2023 main conference paper "Back Translation for Speech-to-text Translation Without Transcripts".
Python
12
star
30

Dual-Path

Code for ACL 2022 main conference paper "Modeling Dual Read/Write Paths for Simultaneous Machine Translation"
Python
12
star
31

NA-MNMT

Source code for "Importance-based Neuron Allocation for Multilingual Neural Machine Translation"
Python
12
star
32

Convex-Learning

Code for NeurIPS 2023 paper "Beyond MLE: Convex Learning for Text Generation"
Python
12
star
33

GMA

Code for ACL 2022 findings paper "Gaussian Multi-head Attention for Simultaneous Machine Translation"
Python
11
star
34

PCFG-NAT

Code for NeurIPS 2023 paper "Non-autoregressive Machine Translation with Probabilistic Context-free Grammar".
Cuda
10
star
35

NAST

Official implementation for EMNLP 2023 paper "Non-autoregressive Streaming Transformer for Simultaneous Translation"
Python
10
star
36

COKD

Code for ACL 2022 main conference paper "Overcoming Catastrophic Forgetting beyond Continual Learning: Balanced Training for Neural Machine Translation".
JavaScript
9
star
37

MoE-Waitk

Code for EMNLP 2021 oral paper "Universal Simultaneous Machine Translation with Mixture-of-Experts Wait-k Policy"
Python
8
star
38

DDRS-NAT

Code for NAACL2022 main conference paper "One Reference Is Not Enough: Diverse Distillation with Reference Selection for Non-Autoregressive Translation"
Python
8
star
39

SU4MT

Code for EMNLP 2023 paper "Enhancing Neural Machine Translation with Semantic Units"
Python
8
star
40

SeerForcingNMT

Source code for "Guiding Teacher Forcing with Seer Forcing for Neural Machine Translation"
7
star
41

Zero-MNMT

Python
7
star
42

Multiscale-Contextualization

ACL2024 Integrating Multi-scale Contextualized Information for Byte-based Neural Machine Translation
Python
7
star
43

Wait-info

Source code for our EMNLP 2022 paper "Wait-info Policy: Balancing Source and Target at Information Level for Simultaneous Machine Translation"
Python
7
star
44

BS-SiMT

Source code for our ACL 2023 paper "Learning Optimal Policy for Simultaneous Machine Translation via Binary Search"
Python
7
star
45

CTC-S2UT

Code for ACL 2024 findings paper "CTC-based Non-autoregressive Textless Speech-to-Speech Translation"
7
star
46

GS4NMT

source code for "Greedy Search with Probabilistic N-gram Matching for Neural Machine Translation"
Python
6
star
47

DST

DST is a Decoder-only simultaneous machine translation model, which can conduct policy decision and translation concurrently
Python
6
star
48

LFR-NMT

Source code for the EMNLP 2022 paper "Continual Learning of Neural Machine Translation within Low Forgetting Risk Regions"
Python
5
star
49

CPDecoder

optimize the decoder of the neural machine translation model by the cube pruning algorithm
Python
5
star
50

nar-tutorial

Slides for NAR tutorial
4
star
51

CAPT

Code for EMNLP 2022 main conference paper "Counterfactual Data Augmentation via Perspective Transition for Open-Domain Dialogues".
4
star
52

PED-SiMT

Code for "Turning Fixed to Adaptive: Integrating Post-Evaluation into Simultaneous Machine Translation"
Python
4
star
53

Glance-SiMT

Python
3
star
54

SAMMT

Code for EMNLP 2023 paper "Bridging the Gap between Synthetic and Authentic Images for Multimodal Machine Translation"
Python
3
star
55

RN4NMT

source code for "Refining Source Representations with Relation Networks for Neural Machine Translation".
PLSQL
2
star
56

Seg2Seg

2
star
57

Tailored-Ref

Code for EMNLP 2023 paper "Simultaneous Machine Translation with Tailored Reference"
2
star
58

SemLing-MNMT

Code for ACL 2024 paper "Improving Multilingual Neural Machine Translation by Utilizing Semantic and Linguistic Features".
Python
2
star
59

ComSpeech-Site

JavaScript
1
star
60

corpus_NKD

data repo for "Knowledge Diffusion for Neural Dialogue Generation"
1
star
61

TruthX-site

HTML
1
star
62

StreamSpeech-site

JavaScript
1
star
63

Rephraser-NAT

Code for AAAI 2023 paper "Rephrasing the Reference for Non-Autoregressive Machine Translation"
1
star
64

Auto-RAG

Python
1
star