• Stars
    star
    313
  • Rank 133,714 (Top 3 %)
  • Language
    Jupyter Notebook
  • License
    BSD 3-Clause "New...
  • Created over 7 years ago
  • Updated about 7 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Transformer of "Attention Is All You Need" (Vaswani et al. 2017) by Chainer.

Transformer - Attention Is All You Need

Chainer-based Python implementation of Transformer, an attention-based seq2seq model without convolution and recurrence.
If you want to see the architecture, please see net.py.

See "Attention Is All You Need", Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, arxiv, 2017.

This repository is partly derived from my convolutional seq2seq repo, which is also derived from Chainer's official seq2seq example.

Requirement

  • Python 3.6.0+
  • Chainer 2.0.0+
  • numpy 1.12.1+
  • cupy 1.0.0+ (if using gpu)
  • nltk
  • progressbar
  • (You can install all through pip)
  • and their dependencies

Prepare Dataset

You can use any parallel corpus.
For example, run

sh download_wmt.sh

which downloads and decompresses training dataset and development dataset from WMT/europal into your current directory. These files and their paths are set in training script train.py as default.

How to Run

PYTHONIOENCODING=utf-8 python -u train.py -g=0 -i DATA_DIR -o SAVE_DIR

During training, logs for loss, perplexity, word accuracy and time are printed at a certain internval, in addition to validation tests (perplexity and BLEU for generation) every half epoch. And also, generation test is performed and printed for checking training progress.

Arguments

Some of them is as follows:

  • -g: your gpu id. If cpu, set -1.
  • -i DATA_DIR, -s SOURCE, -t TARGET, -svalid SVALID, -tvalid TVALID:
    DATA_DIR directory needs to include a pair of training dataset SOURCE and TARGET with a pair of validation dataset SVALID and TVALID. Each pair should be parallell corpus with line-by-line sentence alignment.
  • -o SAVE_DIR: JSON log report file and a model snapshot will be saved in SAVE_DIR directory (if it does not exist, it will be automatically made).
  • -e: max epochs of training corpus.
  • -b: minibatch size.
  • -u: size of units and word embeddings.
  • -l: number of layers in both the encoder and the decoder.
  • --source-vocab: max size of vocabulary set of source language
  • --target-vocab: max size of vocabulary set of target language

Please see the others by python train.py -h.

Note

This repository does not aim for complete validation of results in the paper, so I have not eagerly confirmed validity of performance. But, I expect my implementation is almost compatible with a model described in the paper. Some differences where I am aware are as follows:

  • Optimization/training strategy. Detailed information about batchsize, parameter initialization, etc. is unclear in the paper. Additionally, the learning rate proposed in the paper may work only with a large batchsize (e.g. 4000) for deep layer nets. I changed warmup_step to 32000 from 4000, though there is room for improvement. I also changed relu into leaky relu in feedforward net layers for easy gradient propagation.
  • Vocabulary set, dataset, preprocessing and evaluation. This repo uses a common word-based tokenization, although the paper uses byte-pair encoding. Size of token set also differs. Evaluation (validation) is little unfair and incompatible with one in the paper, e.g., even validation set replaces unknown words to a single "unk" token.
  • Beam search is unused in BLEU calculation.
  • Model size. The setting of a model in this repo is one of "base model" in the paper, although you can modify some lines for using "big model".
  • This code follows some settings used in tensor2tensor repository, which includes a Transformer model. For example, positional encoding used in the repository seems to differ from one in the paper. This code follows the former one.

More Repositories

1

bookcorpus

Crawl BookCorpus
Python
804
star
2

bert-chainer

Chainer implementation of "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding"
Python
220
star
3

dynamic_routing_between_capsules

Implementation of Dynamic Routing Between Capsules, Sara Sabour, Nicholas Frosst, Geoffrey E Hinton, NIPS 2017
Python
205
star
4

convolutional_seq2seq

fairseq: Convolutional Sequence to Sequence Learning (Gehring et al. 2017) by Chainer
Python
65
star
5

arxiv_leaks

Whisper of the arxiv: read comments in tex of papers
Python
31
star
6

chainer-openai-transformer-lm

A Chainer implementation of OpenAI's finetuned transformer language model with a script to import the weights pre-trained by OpenAI
Python
28
star
7

der-network

Dynamic Entity Representation (Kobayashi et al., 2016)
Python
21
star
8

variational_dropout_sparsifies_dnn

Variational Dropout Sparsifies Deep Neural Networks (Molchanov et al. 2017) by Chainer
Python
19
star
9

captioning_chainer

A fast implementation of Neural Image Caption by Chainer
Python
16
star
10

efficient_softmax

BlackOut and Adaptive Softmax for language models by Chainer
Python
11
star
11

ROCStory_skipthought_baseline

A novel baseline model for Story Cloze Test and ROCStories
Python
11
star
12

dynamic_neural_text_model

A Neural Language Model for Dynamically Representing the Meanings of Unknown Words and Entities in a Discourse, Sosuke Kobayashi, Naoaki Okazaki, Kentaro Inui, IJCNLP 2017
9
star
13

interval-bound-propagation-chainer

Sven Gowal et al., Scalable Verified Training for Provably Robust Image Classification, ICCV 2019
Jupyter Notebook
8
star
14

turnover_dropout

Python
7
star
15

learning_to_learn

Learning to learn by gradient descent by gradient descent, Andrychowicz et al., NIPS 2016
Python
7
star
16

decode_from_mask

Generate a sentence from a masked sentence
Python
6
star
17

weight_normalization

Weight Normalization (Salimans and Kingma, 2016) by Chainer
Python
6
star
18

SDCGAN

Sentence generation by DCGAN
Python
5
star
19

elmo-chainer

Chainer implementation of contextualized word representations from bi-directional language models. Copied into https://github.com/chainer/models/tree/master/elmo-chainer
Python
5
star
20

emergence_of_language_using_discrete_sequences

Emergence of Language Using Discrete Sequences
Jupyter Notebook
4
star
21

skip_thought

Language Model and Skip-Thought Vectors (Kiros et al. 2015)
Python
3
star
22

vqvae_chainer

Chainer's Neural Discrete Representation Learning (Aaron van den Oord et al., 2017)
Python
3
star
23

twitter_conversation_crawler

For crawling conversational tweet threads; e.g. datasets for chatbots.
Python
2
star
24

sru_language_model

Language modeling experiments of SRU and variants
Python
2
star
25

rnnlm_chainer

A Fast RNN Language Model by Chainer
Python
2
star