• Stars
    star
    319
  • Rank 131,491 (Top 3 %)
  • Language
    Python
  • License
    BSD 3-Clause "New...
  • Created about 4 years ago
  • Updated over 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A neural word aligner based on multilingual BERT

AWESOME: Aligning Word Embedding Spaces of Multilingual Encoders

awesome-align is a tool that can extract word alignments from multilingual BERT (mBERT) [Demo] and allows you to fine-tune mBERT on parallel corpora for better alignment quality (see our paper for more details).

Dependencies

First, you need to install the dependencies:

pip install -r requirements.txt
python setup.py install

Input format

Inputs should be tokenized and each line is a source language sentence and its target language translation, separated by (|||). You can see some examples in the examples folder.

Extracting alignments

Here is an example of extracting word alignments from multilingual BERT:

DATA_FILE=/path/to/data/file
MODEL_NAME_OR_PATH=bert-base-multilingual-cased
OUTPUT_FILE=/path/to/output/file

CUDA_VISIBLE_DEVICES=0 awesome-align \
    --output_file=$OUTPUT_FILE \
    --model_name_or_path=$MODEL_NAME_OR_PATH \
    --data_file=$DATA_FILE \
    --extraction 'softmax' \
    --batch_size 32

This produces outputs in the i-j Pharaoh format. A pair i-j indicates that the ith word (zero-indexed) of the source sentence is aligned to the jth word of the target sentence.

You can set --output_prob_file if you want to obtain the alignment probability and set --output_word_file if you want to obtain the aligned word pairs (in the src_word<sep>tgt_word format). You can also set --cache_dir to specify where you want to cache multilingual BERT.

You can also set MODEL_NAME_OR_PATH to the path of your fine-tuned model as shown below.

Fine-tuning on parallel data

If there is parallel data available, you can fine-tune embedding models on that data.

Here is an example of fine-tuning mBERT that balances well between efficiency and effectiveness:

TRAIN_FILE=/path/to/train/file
EVAL_FILE=/path/to/eval/file
OUTPUT_DIR=/path/to/output/directory

CUDA_VISIBLE_DEVICES=0 awesome-train \
    --output_dir=$OUTPUT_DIR \
    --model_name_or_path=bert-base-multilingual-cased \
    --extraction 'softmax' \
    --do_train \
    --train_tlm \
    --train_so \
    --train_data_file=$TRAIN_FILE \
    --per_gpu_train_batch_size 2 \
    --gradient_accumulation_steps 4 \
    --num_train_epochs 1 \
    --learning_rate 2e-5 \
    --save_steps 4000 \
    --max_steps 20000 \
    --do_eval \
    --eval_data_file=$EVAL_FILE

You can also fine-tune the model a bit longer with more training objectives for better quality:

TRAIN_FILE=/path/to/train/file
EVAL_FILE=/path/to/eval/file
OUTPUT_DIR=/path/to/output/directory

CUDA_VISIBLE_DEVICES=0 awesome-train \
    --output_dir=$OUTPUT_DIR \
    --model_name_or_path=bert-base-multilingual-cased \
    --extraction 'softmax' \
    --do_train \
    --train_mlm \
    --train_tlm \
    --train_tlm_full \
    --train_so \
    --train_psi \
    --train_data_file=$TRAIN_FILE \
    --per_gpu_train_batch_size 2 \
    --gradient_accumulation_steps 4 \
    --num_train_epochs 1 \
    --learning_rate 2e-5 \
    --save_steps 10000 \
    --max_steps 40000 \
    --do_eval \
    --eval_data_file=$EVAL_FILE

If you want high alignment recalls, you can turn on the --train_co option, but note that the alignment precisions may drop. You can set --cache_dir to specify where you want to cache multilingual BERT.

Supervised settings

In supervised settings where gold word alignments are available for your training data, you can incorporate the supervised signals into our self-training objective (--train_so) and here is an example command:

TRAIN_FILE=/path/to/train/file
TRAIN_GOLD_FILE=/path/to/train/gold/file
OUTPUT_DIR=/path/to/output/directory

CUDA_VISIBLE_DEVICES=0 awesome-train \
    --output_dir=$OUTPUT_DIR \
    --model_name_or_path=bert-base-multilingual-cased \
    --extraction 'softmax' \
    --do_train \
    --train_so \
    --train_data_file=$TRAIN_FILE \
    --train_gold_file=$TRAIN_GOLD_FILE \
    --per_gpu_train_batch_size 2 \
    --gradient_accumulation_steps 4 \
    --num_train_epochs 5 \
    --learning_rate 1e-4 \
    --save_steps 200

See examples/*.gold for the example format of the gold alignments. You need to turn on the --gold_one_index option if the gold alignments are 1-indexed and you can turn on the --ignore_possible_alignments option if you want to ignore possible alignments.

Model performance

The following table shows the alignment error rates (AERs) of our models and popular statistical word aligners on five language pairs. The De-En, Fr-En, Ro-En datasets can be obtained following this repo, the Ja-En data is from this link and the Zh-En data is available at this link. The best scores are in bold.

De-En Fr-En Ro-En Ja-En Zh-En
fast_align 27.0 10.5 32.1 51.1 38.1
eflomal 22.6 8.2 25.1 47.5 28.7
Mgiza 20.6 5.9 26.4 48.0 35.1
Ours (w/o fine-tuning, softmax) 17.4 5.6 27.9 45.6 18.1
Ours (multilingually fine-tuned
w/o --train_co, softmax) [Download]
15.2 4.1 22.6 37.4 13.4
Ours (multilingually fine-tuned
w/ --train_co, softmax) [Download]
15.1 4.5 20.7 38.4 14.5

Citation

If you use our tool, we'd appreciate if you cite the following paper:

@inproceedings{dou2021word,
  title={Word Alignment by Fine-tuning Embeddings on Parallel Corpora},
  author={Dou, Zi-Yi and Neubig, Graham},
  booktitle={Conference of the European Chapter of the Association for Computational Linguistics (EACL)},
  year={2021}
}

Acknowledgements

Some of the code is borrowed from HuggingFace Transformers licensed under Apache 2.0 and the entmax implementation is from this repo.

More Repositories

1

prompt2model

prompt2model - Generate Deployable Models from Natural Language Instructions
Python
1,953
star
2

Text-Summarization-Papers

An Exhaustive Paper List for Text Summarization
HTML
500
star
3

compare-mt

A tool for holistic analysis of language generations systems
Python
450
star
4

nn4nlp-concepts

A repository of concepts related to neural networks for NLP
Python
447
star
5

ExplainaBoard

Interpretable Evaluation for AI Systems
Python
361
star
6

BARTScore

BARTScore: Evaluating Generated Text as Text Generation
Python
317
star
7

knn-transformers

PyTorch + HuggingFace code for RetoMaton: "Neuro-Symbolic Language Modeling with Automaton-augmented Retrieval" (ICML 2022), including an implementation of kNN-LM and kNN-MT
Python
249
star
8

InterpretEval

Interpretable Evaluation for (Almost) All NLP Tasks
HTML
193
star
9

ReviewAdvisor

Heavy Workload on Reviewing Papers? ReviewAdvisor Helps out
Python
192
star
10

xnmt

eXtensible Neural Machine Translation
Python
185
star
11

gemini-benchmark

Jupyter Notebook
150
star
12

RIPPLe

Code for the paper "Weight Poisoning Attacks on Pre-trained Models" (ACL 2020)
Jupyter Notebook
138
star
13

SpanNER

SpanNER: Named EntityRe-/Recognition as Span Prediction
Python
124
star
14

word-embeddings-for-nmt

Supplementary material for "When and Why Are Pre-trained Word Embeddings Useful for Neural Machine Translation?" at NAACL 2018
Python
119
star
15

guided_summarization

GSum: A General Framework for Guided Neural Abstractive Summarization
Python
112
star
16

code-bert-score

CodeBERTScore: an automatic metric for code generation, based on BERTScore
Jupyter Notebook
111
star
17

external-knowledge-codegen

Code and data for ACL20 paper "Incorporating External Knowledge through Pre-training for Natural Language to Code Generation"
Python
96
star
18

cmu-multinlp

Generalizing Natural Language Analysis through Span-relation Representations
Python
88
star
19

REALSumm

REALSumm: Re-evaluating Evaluation in Text Summarization
Python
71
star
20

langrank

A program to choose transfer languages for cross-lingual learning
Python
66
star
21

retomaton

PyTorch code for the RetoMaton paper: "Neuro-Symbolic Language Modeling with Automaton-augmented Retrieval" (ICML 2022)
Python
60
star
22

dynet-benchmark

Benchmarks for DyNet
Python
56
star
23

newlang-tech

A guide to building language technology in new languages.
56
star
24

ragged

Retrieval Augmented Generation Generalized Evaluation Dataset
Jupyter Notebook
51
star
25

contextual-mt

A repository with the code related to experiments around context-aware machine translation
Python
48
star
26

extreme-adaptation-for-personalized-translation

Code for the paper "Extreme Adaptation for Personalized Neural Machine Translation"
Python
43
star
27

lrlm

Code for the paper "Latent Relation Language Models" at AAAI-20.
Python
41
star
28

incremental_tree_edit

Code for "Learning Structural Edits via Incremental Tree Transformations" (ICLR'21)
Python
40
star
29

wikiasp

Code for WikiAsp: Multi-document aspect-based summarization.
Shell
39
star
30

tranX-plugin

A plugin for code generation in PyCharm/IntelliJ using tranX
Java
35
star
31

neural-lpcfg

The Return of Lexical Dependencies: Neural Lexicalized PCFGs (TACL)
Python
33
star
32

covid19-datashare

A repo for sharing language resources related to the outbreak (in machine readable format)
GLSL
27
star
33

ToM-Language-Acquisition

Code used to run experiments for the ICLR 2023 paper "Computational Language Acquisition with Theory of Mind".
Python
14
star
34

cmulab

CMU Linguistic Annotation Backend
Python
14
star
35

AfricanVoices

Hosts text-to-speech corpus and speech synthesizers for African languages.
Shell
12
star
36

cmu-ner

NER System Developed at CMU
Python
12
star
37

lti-llm-deployment

Python
12
star
38

explainaboard_web

Mustache
8
star
39

KGxBoard

Explainable and Interactive Leaderboard for Evaluation of Knowledge Graph Completion Models
6
star
40

DGT

WNGT 2019, DGT Task.
Python
6
star
41

tranx-study

HTML
5
star
42

Reliable-NLPPP

Jupyter Notebook
5
star
43

cord19

cord19 related stuff
Python
5
star
44

globalbench

GlobalBench: A Benchmark for Global Progress in Language Technology
Python
5
star
45

jsalt2019-informal

A repository for random things from the JSALT informal translation group
Python
5
star
46

cmu-edl

Python
3
star
47

code-mining

Stuff for code mining
OpenEdge ABL
2
star
48

ocr-web-interface

OCR web interface using CMULAB backend
JavaScript
1
star
49

explainaboard_client

Python
1
star