• Stars
    star
    345
  • Rank 122,750 (Top 3 %)
  • Language
    Python
  • License
    MIT License
  • Created over 4 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Obtain Word Alignments using Pretrained Language Models (e.g., mBERT)

SimAlign: Similarity Based Word Aligner


Alignment Example

SimAlign is a high-quality word alignment tool that uses static and contextualized embeddings and does not require parallel training data.

The following table shows how it compares to popular statistical alignment models:

ENG-CES ENG-DEU ENG-FAS ENG-FRA ENG-HIN ENG-RON
fast-align .78 .71 .46 .84 .38 .68
eflomal .85 .77 .63 .93 .52 .72
mBERT-Argmax .87 .81 .67 .94 .55 .65

Shown is F1, maximum across subword and word level. For more details see the Paper.

Installation and Usage

Tested with Python 3.7, Transformers 3.1.0, Torch 1.5.0. Networkx 2.4 is optional (only required for Match algorithm). For full list of dependencies see setup.py. For installation of transformers see their repo.

Download the repo for use or alternatively install with PyPi

pip install simalign

or directly with pip from GitHub

pip install --upgrade git+https://github.com/cisnlp/simalign.git#egg=simalign

An example for using our code:

from simalign import SentenceAligner

# making an instance of our model.
# You can specify the embedding model and all alignment settings in the constructor.
myaligner = SentenceAligner(model="bert", token_type="bpe", matching_methods="mai")

# The source and target sentences should be tokenized to words.
src_sentence = ["This", "is", "a", "test", "."]
trg_sentence = ["Das", "ist", "ein", "Test", "."]

# The output is a dictionary with different matching methods.
# Each method has a list of pairs indicating the indexes of aligned words (The alignments are zero-indexed).
alignments = myaligner.get_word_aligns(src_sentence, trg_sentence)

for matching_method in alignments:
    print(matching_method, ":", alignments[matching_method])

# Expected output:
# mwmf (Match): [(0, 0), (1, 1), (2, 2), (3, 3), (4, 4)]
# inter (ArgMax): [(0, 0), (1, 1), (2, 2), (3, 3), (4, 4)]
# itermax (IterMax): [(0, 0), (1, 1), (2, 2), (3, 3), (4, 4)]

For more examples of how to use our code see scripts/align_example.py.

Demo

An online demo is available here.

Gold Standards

Links to the gold standars used in the paper are here:

Language Pair Citation Type Link
ENG-CES Marecek et al. 2008 Gold Alignment http://ufal.mff.cuni.cz/czech-english-manual-word-alignment
ENG-DEU EuroParl-based Gold Alignment www-i6.informatik.rwth-aachen.de/goldAlignment/
ENG-FAS Tvakoli et al. 2014 Gold Alignment http://eceold.ut.ac.ir/en/node/940
ENG-FRA WPT2003, Och et al. 2000, Gold Alignment http://web.eecs.umich.edu/~mihalcea/wpt/
ENG-HIN WPT2005 Gold Alignment http://web.eecs.umich.edu/~mihalcea/wpt05/
ENG-RON WPT2005 Mihalcea et al. 2003 Gold Alignment http://web.eecs.umich.edu/~mihalcea/wpt05/

Evaluation Script

For evaluating the output alignments use scripts/calc_align_score.py.

The gold alignment file should have the same format as SimAlign outputs. Sure alignment edges in the gold standard have a '-' between the source and the target indices and the possible edges have a 'p' between indices. For sample parallel sentences and their gold alignments from ENG-DEU, see samples.

Publication

If you use the code, please cite

@inproceedings{jalili-sabet-etal-2020-simalign,
    title = "{S}im{A}lign: High Quality Word Alignments without Parallel Training Data using Static and Contextualized Embeddings",
    author = {Jalili Sabet, Masoud  and
      Dufter, Philipp  and
      Yvon, Fran{\c{c}}ois  and
      Sch{\"u}tze, Hinrich},
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.findings-emnlp.147",
    pages = "1627--1643",
}

Feedback

Feedback and Contributions more than welcome! Just reach out to @masoudjs or @pdufter.

FAQ

Do I need parallel data to train the system?

No, no parallel training data is required.

Which languages can be aligned?

This depends on the underlying pretrained multilingual language model used. For example, if mBERT is used, it covers 104 languages as listed here.

Do I need GPUs for running this?

Each alignment simply requires a single forward pass in the pretrained language model. While this is certainly faster on GPU, it runs fine on CPU. On one GPU (GeForce GTX 1080 Ti) it takes around 15-20 seconds to align 500 parallel sentences.

License

Copyright (C) 2020, Masoud Jalili Sabet, Philipp Dufter

A full copy of the license can be found in LICENSE.

More Repositories

1

Glot500

Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages -- ACL 2023
Python
96
star
2

GlotLID

GlotLID: Language Identification with Support for More Than 2000 Labels -- EMNLP 2023
Python
83
star
3

semi-markov-crf

Code for paper "Neural Semi-Markov Conditional Random Fields for Robust Character-Based Part-of-Speech Tagging"
Python
17
star
4

GlotScript

GlotScript: A Resource and Tool for Low Resource Writing System Identification -- LREC 2024
Python
13
star
5

parcoure

ParCourE - Parallel Corpus Explorer
Python
12
star
6

ofa

A Framework aims to wisely initialize unseen subword embeddings in PLMs for efficient large-scale continued pretraining
Python
11
star
7

GlotCC

GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages -- under review
Jupyter Notebook
11
star
8

bias-in-nlp

Literature overview: gender bias in natural language processing
Python
10
star
9

mPLM-Sim

mPLM-Sim: Better Cross-Lingual Similarity and Transfer in Multilingual Pretrained Language Models
Python
10
star
10

graph-align

code for EMNLP graph align paper
Python
9
star
11

Taxi1500

Python
7
star
12

GlotWeb

GlotWeb: Web Indexing for Low-Resource Languages -- under construction.
Python
5
star
13

TransMI

TransMI: A Framework to Create Strong Baselines from Multilingual Pretrained Language Models for Transliterated Data
Python
4
star
14

TransliCo

TransliCo: A Contrastive Learning Framework to Address the Script Barrier in Multilingual Pretrained Language Models
Python
4
star
15

GlotStoryBook

Children StoryBooks for 180 langauges.
Jupyter Notebook
3
star
16

ColexificationNet

Crosslingual Transfer Learning for Low-Resource Languages Based on Multilingual Colexification Graphs
Jupyter Notebook
3
star
17

cisnlp.github.io

Homepage of cisnlp
SCSS
3
star
18

MaskLID

MaskLID: Code-Switching Language Identification through Iterative Masking -- ACL 2024
Python
3
star
19

Transliteration-PPA

Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment
Python
2
star
20

lohoravens-webpage

JavaScript
2
star
21

XAMPLER

XAMPLER: Learning to Retrieve Cross-Lingual In-Context Examples
Python
2
star
22

Spatial_Schemas

JavaScript
1
star