• Stars
    star
    703
  • Rank 64,412 (Top 2 %)
  • Language
    Python
  • License
    BSD 3-Clause "New...
  • Created over 7 years ago
  • Updated over 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

An open-source neural machine translation toolkit developed by Tsinghua Natural Language Processing Group

THUMT: An Open Source Toolkit for Neural Machine Translation

Contents

Introduction

Machine translation is a natural language processing task that aims to translate natural languages using computers automatically. Recent several years have witnessed the rapid development of end-to-end neural machine translation, which has become the new mainstream method in practical MT systems.

THUMT is an open-source toolkit for neural machine translation developed by the Natural Language Processing Group at Tsinghua University. The website of THUMT is: http://thumt.thunlp.org/.

Online Demo

The online demo of THUMT is available at http://translate.thumt.cn/. The languages involved include Ancient Chinese, Arabic, Chinese, English, French, German, Indonesian, Japanese, Portuguese, Russian, and Spanish.

Implementations

THUMT has currently three main implementations:

The following table summarizes the features of three implementations:

Implementation Model Criterion Optimizer LRP
Theano RNNsearch MLE, MRT, SST SGD, AdaDelta, Adam RNNsearch
TensorFlow Seq2Seq, RNNsearch, Transformer MLE Adam RNNsearch, Transformer
PyTorch Transformer MLE SGD, Adadelta, Adam N.A.

We recommend using THUMT-PyTorch or THUMT-TensorFlow, which delivers better translation performance than THUMT-Theano. We will keep adding new features to THUMT-PyTorch and THUMT-TensorFlow.

Notable Features

  • Transformer (Vaswani et al., 2017)
  • Multi-GPU training & decoding
  • Multi-worker distributed training
  • Mixed precision training & decoding
  • Model ensemble & averaging
  • Gradient aggregation
  • TensorBoard for visualization

Documentation

The documentation of PyTorch implementation is avaiable at here.

License

The source code is dual licensed. Open source licensing is under the BSD-3-Clause, which allows free use for research purposes. For commercial licensing, please email [email protected].

Citation

Please cite the following paper:

Zhixing Tan, Jiacheng Zhang, Xuancheng Huang, Gang Chen, Shuo Wang, Maosong Sun, Huanbo Luan, Yang Liu. THUMT: An Open Source Toolkit for Neural Machine Translation. AMTA 2020.

Jiacheng Zhang, Yanzhuo Ding, Shiqi Shen, Yong Cheng, Maosong Sun, Huanbo Luan, Yang Liu. 2017. THUMT: An Open Source Toolkit for Neural Machine Translation. arXiv:1706.06415.

Development Team

Project leaders: Maosong Sun, Yang Liu, Huanbo Luan

Project members:

Theano: Jiacheng Zhang, Yanzhuo Ding, Shiqi Shen, Yong Cheng

TensorFlow: Zhixing Tan, Jiacheng Zhang, Xuancheng Huang, Gang Chen, Shuo Wang, Zonghan Yang

PyTorch: Zhixing Tan, Gang Chen

Contact

If you have questions, suggestions and bug reports, please email [email protected].

Derivative Repositories

  • UCE4BT (Improving Back-Translation with Uncertainty-based Confidence Estimation)
  • L2Copy4APE (Learning to Copy for Automatic Post-Editing)
  • Document-Transformer (Improving the Transformer Translation Model with Document-Level Context)
  • PR4NMT (Prior Knowledge Integration for Neural Machine Translation using Posterior Regularization)

More Repositories

1

MT-Reading-List

A machine translation reading list maintained by Tsinghua Natural Language Processing Group
TeX
2,425
star
2

TG-Reading-List

A text generation reading list maintained by Tsinghua Natural Language Processing Group.
TeX
443
star
3

Document-Transformer

Improving the Transformer translation model with document-level context
Python
172
star
4

StableToolBench

A new tool learning benchmark aiming at well-balanced stability and reality, based on ToolBench.
Python
90
star
5

dyMEAN

This repo contains the codes for our paper "End-to-End Full-Atom Antibody Design"
Python
89
star
6

MEAN

This repo contains the codes for our paper Conditional Antibody Design as 3D Equivariant Graph Translation.
Python
84
star
7

Mask-Align

Code for our paper "Mask-Align: Self-Supervised Neural Word Alignment" in ACL 2021
Python
58
star
8

THUCC

An open-source classical Chinese information processing toolkit developed by Tsinghua Natural Language Processing Group
Python
47
star
9

PS-VAE

This repo contains the codes for our paper: Molecule Generation by Principal Subgraph Mining and Assembling.
Python
33
star
10

Template-NMT

Python
22
star
11

SKR

Self-Knowledge Guided Retrieval Augmentation for Large Language Models (EMNLP Findings 2023)
Python
21
star
12

PLM4MT

Code for our work "MSP: Multi-Stage Prompting for Making Pre-trained Language Models Better Translators" in ACL 2022
Python
20
star
13

UCE4BT

Python
19
star
14

GET

This repo contains the codes for our paper "Generalist Equivariant Transformer Towards 3D Molecular Interaction Learning" (ICML 2024).
Python
19
star
15

MT-Toolkit-List

A list of machine translation open-source toolkits maintained by Tsinghua Natural Language Processing Group
13
star
16

PR4NMT

Prior Knowledge Integration for Neural Machine Translation using Posterior Regularization
Python
12
star
17

DirectQuote

A Dataset for Direct Quotation Extraction and Attribution in News Articles.
12
star
18

L2Copy4APE

Learning to Copy for Automatic Post-Editing (EMNLP 2019)
Python
11
star
19

TRICE

Code for our paper "Transfer Learning for Sequence Generation: from Single-source to Multi-source" in ACL 2021.
Python
11
star
20

PromptGating4MCTG

This is the repo for our work “An Extensible Plug-and-Play Method for Multi-Aspect Controllable Text Generation” (ACL 2023).
Python
9
star
21

UBiLexAT

An Unsupervised Bilingual Lexicon Inducer From Non-Parallel Data by Adversarial Training
Python
8
star
22

ktnmt

Python
7
star
23

PGRA

Prompt-Guided Retrieval For Non-Knowledge-Intensive Tasks
Python
7
star
24

SelfSupervisedQE

Self-Supervised Quality Estimation for Machine Translation
Python
6
star
25

DBKD-PLM

Codebase for ACL 2023 conference long paper Bridging the Gap between Decision and Logits in Decision-based Knowledge Distillation for Pre-trained Language Models.
Python
6
star
26

FIIG

Filling the Image Information Gap for VQA: Prompting Large Language Models to Proactively Ask Questions (EMNLP 2023 Findings)
6
star
27

BiLex

A Bilingual Lexicon Inducer From Non-Parallel Data
C
5
star
28

UBiLexEMD

An Unsupervised Bilingual Lexicon Inducer From Non-Parallel Data by Earth Mover's Distance Minimization
Python
5
star
29

symbol2language

Speak It Out: Solving Symbol-Related Problems with Symbol-to-Language Conversion for Language Models
5
star
30

TRAN

This is the repo for our work “Failures Pave the Way: Enhancing Large Language Models through Tuning-free Rule Accumulation” (EMNLP 2023).
Python
5
star
31

Voting4SC

Modeling Voting for System Combination in Machine Translation (IJCAI 2020)
Python
4
star
32

CODIS

Repo for paper "CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models".
JavaScript
4
star
33

ModelCompose

3
star
34

MetaRanking

Official code repo for our work "Meta Ranking: Less Capable Language Models are Capable for Single Response Judgement".
2
star
35

MT-Dataset-List

A list machine translation datasets maintained by Tsinghua Natural Language Processing Group
2
star
36

DEEM

2
star
37

ROGO

This repo contains the codes for our work “Restricted orthogonal gradient projection for continual learning”.
Python
1
star
38

Brote

Python
1
star
39

PANDA

Python
1
star
40

Transformer-DMB

Codes for our paper "Dynamic Multi-Branch Layers for On-Device Neural Machine Translation" in TASLP
Python
1
star
41

RiC

1
star
42

CKD

Continual Knowledge Distillation for Neural Machine Translation
Python
1
star
43

BTP

Python
1
star