• Stars
    star
    450
  • Rank 97,143 (Top 2 %)
  • Language
    Python
  • License
    BSD 3-Clause "New...
  • Created over 6 years ago
  • Updated over 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A tool for holistic analysis of language generations systems

compare-mt

by NeuLab @ CMU LTI, and other contributors

Integration Tests

compare-mt (for "compare my text") is a program to compare the output of multiple systems for language generation, including machine translation, summarization, dialog response generation, etc. To use it you need to have, in text format, a "correct" reference, and the output of two different systems. Based on this, compare-mt will run a number of analyses that attempt to pick out salient differences between the systems, which will make it easier for you to figure out what things one system is doing better than another.

Basic Usage

First, you need to install the package:

# Requirements
pip install -r requirements.txt
# Install the package
python setup.py install

Then, as an example, you can run this over two included system outputs.

compare-mt --output_directory output/ example/ted.ref.eng example/ted.sys1.eng example/ted.sys2.eng

This will output some statistics to the command line, and also write a formatted HTML report to output/. Here, system 1 and system 2 are the baseline phrase-based and neural Slovak-English systems from our EMNLP 2018 paper. This will print out a number of statistics including:

  • Aggregate Scores: A report on overall BLEU scores and length ratios
  • Word Accuracy Analysis: A report on the F-measure of words by frequency bucket
  • Sentence Bucket Analysis: Bucket sentences by various statistics (e.g. sentence BLEU, length difference with the reference, overall length), and calculate statistics by bucket (e.g. number of sentences, BLEU score per bucket)
  • N-gram Difference Analysis: Calculate which n-grams one system is consistently translating better
  • Sentence Examples: Find sentences where one system is doing better than the other according to sentence BLEU

You can see an example of running this analysis (as well as the more advanced analysis below) either through a generated HTML report here, or in the following narrated video:

IMAGE ALT TEXT HERE

To summarize the results that immediately stick out from the basic analysis:

  • From the aggregate scores we can see that the BLEU of neural MT is higher, but its sentences are slightly shorter.
  • From the word accuracy analysis we can see that phrase-based MT is better at low-frequency words.
  • From the sentence bucket analysis we can see that neural seems to be better at translating shorter sentences.
  • From the n-gram difference analysis we can see that there are a few words that neural MT is not good at but phrase based MT gets right (e.g. "phantom"), while there are a few long phrases that neural MT does better with (e.g. "going to show you").

If you run on your own data, you might be able to find more interesting things about your own systems. Try comparing your modified system with your baseline and seeing what you find!

Other Options

There are many options that can be used to do different types of analysis. If you want to find all the different types of analysis supported, the most comprehensive way to do so is by taking a look at compare-mt, which is documented relatively well and should give examples. We do highlight a few particularly useful and common types of analysis below:

Significance Tests

The script allows you to perform statistical significance tests for scores based on bootstrap resampling. You can set the number of samples manually. Here is an example using the example data:

compare-mt example/ted.ref.eng example/ted.sys1.eng example/ted.sys2.eng --compare_scores score_type=bleu,bootstrap=1000,prob_thresh=0.05

One important thing to note is that bootrap resampling as implemented in compare-mt only tests for variance due to data sampling, approximately answering the question ``if I ran the same system on a different, similarly sampled dataset, would I be likely to get the same result?''. It does not say anything about whether a system will perform better on another dataset in a different domain, and it does not control for training-time factors such as selection of the random seed, so it cannot say if another training run of the same model would yield the same result.

Using Training Set Frequency

One useful piece of analysis is the "word accuracy by frequency" analysis. By default this frequency is the frequency in the test set, but arguably it is more informative to know accuracy by frequency in the training set as this demonstrates the models' robustness to words they haven't seen much, or at all, in the training data. To change the corpus used to calculate word frequency and use the training set (or some other set), you can set the freq_corpus_file option to the appropriate corpus.

compare-mt example/ted.ref.eng example/ted.sys1.eng example/ted.sys2.eng
        --compare_word_accuracies bucket_type=freq,freq_corpus_file=example/ted.train.eng

In addition, because training sets may be very big, you can also calculate the counts on the file beforehand,

python scripts/count.py < example/ted.train.eng > example/ted.train.counts

and then use these counts directly to improve efficiency.

compare-mt example/ted.ref.eng example/ted.sys1.eng example/ted.sys2.eng
        --compare_word_accuracies bucket_type=freq,freq_count_file=example/ted.train.counts

Incorporating Word/Sentence Labels

If you're interested in performing aggregate analysis over labels for each word/sentence instead of the words/sentences themselves, it is possible to do so. As an example, we've included POS tags for each of the example outputs. You can use these in aggregate analysis, or n-gram-based analysis. The following gives an example:

compare-mt example/ted.ref.eng example/ted.sys1.eng example/ted.sys2.eng 
    --compare_word_accuracies bucket_type=label,ref_labels=example/ted.ref.eng.tag,out_labels="example/ted.sys1.eng.tag;example/ted.sys2.eng.tag",label_set=CC+DT+IN+JJ+NN+NNP+NNS+PRP+RB+TO+VB+VBP+VBZ 
    --compare_ngrams compare_type=match,ref_labels=example/ted.ref.eng.tag,out_labels="example/ted.sys1.eng.tag;example/ted.sys2.eng.tag"

This will calculate word accuracies and n-gram matches by POS bucket, and allows you to see things like the fact that the phrase-based MT system is better at translating content words such as nouns and verbs, while neural MT is doing better at translating function words.

We also give an example to perform aggregate analysis when multiple labels per word/sentence, where each group of labels is a string separated by '+'s, are allowed:

compare-mt example/multited.ref.jpn example/multited.sys1.jpn example/multited.sys2.jpn 
    --compare_word_accuracies bucket_type=multilabel,ref_labels=example/multited.ref.jpn.tag,out_labels="example/multited.sys1.jpn.tag;example/multited.sys2.jpn.tag",label_set=lexical+formality+pronouns+ellipsis

It also is possible to create labels that represent numberical values. For example, scripts/relativepositiontag.py calculates the relative position of words in the sentence, where 0 is the first word in the sentence, 0.5 is the word in the middle, and 1.0 is the word in the end. These numerical values can then be bucketed. Here is an example:

compare-mt example/ted.ref.eng example/ted.sys1.eng example/ted.sys2.eng 
    --compare_word_accuracies bucket_type=numlabel,ref_labels=example/ted.ref.eng.rptag,out_labels="example/ted.sys1.eng.rptag;example/ted.sys2.eng.rptag"

From this particular analysis we can discover that NMT does worse than PBMT at the end of the sentence, and of course other varieties of numerical labels could be used to measure different properties of words.

You can also perform analysis over labels for sentences. Here is an example:

compare-mt example/ted.ref.eng example/ted.sys1.eng example/ted.sys2.eng 
    --compare_sentence_buckets 'bucket_type=label,out_labels=example/ted.sys1.eng.senttag;example/ted.sys2.eng.senttag,label_set=0+10+20+30+40+50+60+70+80+90+100,statistic_type=score,score_measure=bleu'

Analyzing Source Words

If you have a source corpus that is aligned to the target, you can also analyze accuracies according to features of the source language words, which would allow you to examine whether, for example, infrequent words on the source side are hard to output properly. Here is an example using the example data:

compare-mt example/ted.ref.eng example/ted.sys1.eng example/ted.sys2.eng --src_file example/ted.orig.slk --compare_src_word_accuracies ref_align_file=example/ted.ref.align

Analyzing Word Likelihoods

If you wish to analyze the word log likelihoods by two systems on the target corpus, you can use the following

compare-ll --ref example/ll_test.txt --ll-files example/ll_test.sys1.likelihood example/ll_test.sys2.likelihood --compare-word-likelihoods bucket_type=freq,freq_corpus_file=example/ll_test.txt

You can analyze the word log likelihoods over labels for each word instead of the words themselves:

compare-ll --ref example/ll_test.txt --ll-files example/ll_test.sys1.likelihood example/ll_test.sys2.likelihood --compare-word-likelihoods bucket_type=label,label_corpus=example/ll_test.tag,label_set=CC+DT+IN+JJ+NN+NNP+NNS+PRP+RB+TO+VB+VBP+VBZ

NOTE: You can also use the above to also analyze the word likelihoods produced by two language models.

Analyzing Other Language Generation Systems

You can also analyze other language generation systems using the script. Here is an example of comparing two text summarization systems.

compare-mt example/sum.ref.eng example/sum.sys1.eng example/sum.sys2.eng --compare_scores 'score_type=rouge1' 'score_type=rouge2' 'score_type=rougeL'

Evaluating on COMET

It is possible to use the COMET as a metric. To do so, you need to install it first by running

pip install unbabel-comet

To then run, pass the source and select the appropriate score type. Here is an example.

compare-mt example/ted.ref.eng example/ted.sys1.eng example/ted.sys2.eng --src_file example/ted.orig.slk \
  --compare_scores score_type=comet \
  --compare_sentence_buckets bucket_type=score,score_measure=sentcomet

Note that COMET runs on top of XLM-R, so it's highly recommended you use a GPU with it.

Citation/References

If you use compare-mt, we'd appreciate if you cite the paper about it!

@article{DBLP:journals/corr/abs-1903-07926,
  author    = {Graham Neubig and Zi{-}Yi Dou and Junjie Hu and Paul Michel and Danish Pruthi and Xinyi Wang and John Wieting},
  title     = {compare-mt: {A} Tool for Holistic Comparison of Language Generation Systems},
  journal   = {CoRR},
  volume    = {abs/1903.07926},
  year      = {2019},
  url       = {http://arxiv.org/abs/1903.07926},
}

There is an extensive literature review included in the paper above, but some key papers that it borrows ideas from are below:

There is also other good software for automatic comparison or error analysis of MT systems:

  • MT-ComparEval: Very nice for visualization of individual examples, but not as focused on aggregate analysis as compare-mt. Also has more software dependencies and requires using a web browser, while compare-mt can be used as a command-line tool.

More Repositories

1

prompt2model

prompt2model - Generate Deployable Models from Natural Language Instructions
Python
1,953
star
2

Text-Summarization-Papers

An Exhaustive Paper List for Text Summarization
HTML
500
star
3

nn4nlp-concepts

A repository of concepts related to neural networks for NLP
Python
447
star
4

ExplainaBoard

Interpretable Evaluation for AI Systems
Python
361
star
5

awesome-align

A neural word aligner based on multilingual BERT
Python
319
star
6

BARTScore

BARTScore: Evaluating Generated Text as Text Generation
Python
317
star
7

knn-transformers

PyTorch + HuggingFace code for RetoMaton: "Neuro-Symbolic Language Modeling with Automaton-augmented Retrieval" (ICML 2022), including an implementation of kNN-LM and kNN-MT
Python
249
star
8

InterpretEval

Interpretable Evaluation for (Almost) All NLP Tasks
HTML
193
star
9

ReviewAdvisor

Heavy Workload on Reviewing Papers? ReviewAdvisor Helps out
Python
192
star
10

xnmt

eXtensible Neural Machine Translation
Python
185
star
11

gemini-benchmark

Jupyter Notebook
150
star
12

RIPPLe

Code for the paper "Weight Poisoning Attacks on Pre-trained Models" (ACL 2020)
Jupyter Notebook
138
star
13

SpanNER

SpanNER: Named EntityRe-/Recognition as Span Prediction
Python
124
star
14

word-embeddings-for-nmt

Supplementary material for "When and Why Are Pre-trained Word Embeddings Useful for Neural Machine Translation?" at NAACL 2018
Python
119
star
15

guided_summarization

GSum: A General Framework for Guided Neural Abstractive Summarization
Python
112
star
16

code-bert-score

CodeBERTScore: an automatic metric for code generation, based on BERTScore
Jupyter Notebook
111
star
17

external-knowledge-codegen

Code and data for ACL20 paper "Incorporating External Knowledge through Pre-training for Natural Language to Code Generation"
Python
96
star
18

cmu-multinlp

Generalizing Natural Language Analysis through Span-relation Representations
Python
88
star
19

REALSumm

REALSumm: Re-evaluating Evaluation in Text Summarization
Python
71
star
20

langrank

A program to choose transfer languages for cross-lingual learning
Python
66
star
21

retomaton

PyTorch code for the RetoMaton paper: "Neuro-Symbolic Language Modeling with Automaton-augmented Retrieval" (ICML 2022)
Python
60
star
22

dynet-benchmark

Benchmarks for DyNet
Python
56
star
23

newlang-tech

A guide to building language technology in new languages.
56
star
24

ragged

Retrieval Augmented Generation Generalized Evaluation Dataset
Jupyter Notebook
51
star
25

contextual-mt

A repository with the code related to experiments around context-aware machine translation
Python
48
star
26

extreme-adaptation-for-personalized-translation

Code for the paper "Extreme Adaptation for Personalized Neural Machine Translation"
Python
43
star
27

lrlm

Code for the paper "Latent Relation Language Models" at AAAI-20.
Python
41
star
28

incremental_tree_edit

Code for "Learning Structural Edits via Incremental Tree Transformations" (ICLR'21)
Python
40
star
29

wikiasp

Code for WikiAsp: Multi-document aspect-based summarization.
Shell
39
star
30

tranX-plugin

A plugin for code generation in PyCharm/IntelliJ using tranX
Java
35
star
31

neural-lpcfg

The Return of Lexical Dependencies: Neural Lexicalized PCFGs (TACL)
Python
33
star
32

covid19-datashare

A repo for sharing language resources related to the outbreak (in machine readable format)
GLSL
27
star
33

ToM-Language-Acquisition

Code used to run experiments for the ICLR 2023 paper "Computational Language Acquisition with Theory of Mind".
Python
14
star
34

cmulab

CMU Linguistic Annotation Backend
Python
14
star
35

AfricanVoices

Hosts text-to-speech corpus and speech synthesizers for African languages.
Shell
12
star
36

cmu-ner

NER System Developed at CMU
Python
12
star
37

lti-llm-deployment

Python
12
star
38

explainaboard_web

Mustache
8
star
39

KGxBoard

Explainable and Interactive Leaderboard for Evaluation of Knowledge Graph Completion Models
6
star
40

DGT

WNGT 2019, DGT Task.
Python
6
star
41

tranx-study

HTML
5
star
42

Reliable-NLPPP

Jupyter Notebook
5
star
43

cord19

cord19 related stuff
Python
5
star
44

globalbench

GlobalBench: A Benchmark for Global Progress in Language Technology
Python
5
star
45

jsalt2019-informal

A repository for random things from the JSALT informal translation group
Python
5
star
46

cmu-edl

Python
3
star
47

code-mining

Stuff for code mining
OpenEdge ABL
2
star
48

ocr-web-interface

OCR web interface using CMULAB backend
JavaScript
1
star
49

explainaboard_client

Python
1
star