• Stars
    star
    499
  • Rank 88,341 (Top 2 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created over 4 years ago
  • Updated 4 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A Neural Framework for MT Evaluation



License GitHub stars PyPI Code Style

Quick Installation

COMET requires python 3.8 or above!

Simple installation from PyPI

pip install --upgrade pip  # ensures that pip is current 
pip install unbabel-comet

To develop locally install run the following commands:

git clone https://github.com/Unbabel/COMET
cd COMET
pip install poetry
poetry install

For development, you can run the CLI tools directly, e.g.,

PYTHONPATH=. ./comet/cli/score.py

Scoring MT outputs:

CLI Usage:

Test examples:

echo -e "Dem Feuer konnte Einhalt geboten werden\nSchulen und Kindergärten wurden eröffnet." >> src.de
echo -e "The fire could be stopped\nSchools and kindergartens were open" >> hyp1.en
echo -e "The fire could have been stopped\nSchools and pre-school were open" >> hyp2.en
echo -e "They were able to control the fire.\nSchools and kindergartens opened" >> ref.en

Basic scoring command:

comet-score -s src.de -t hyp1.en -r ref.en

you can set the number of gpus using --gpus (0 to test on CPU).

Scoring multiple systems:

comet-score -s src.de -t hyp1.en hyp2.en -r ref.en

WMT test sets via SacreBLEU:

comet-score -d wmt22:en-de -t PATH/TO/TRANSLATIONS

If you are only interested in a system-level score use the following command:

comet-score -s src.de -t hyp1.en -r ref.en --quiet --only_system

Reference-free evaluation:

comet-score -s src.de -t hyp1.en --model Unbabel/wmt20-comet-qe-da

Note: We are currently working on Licensing and releasing Unbabel/wmt22-cometkiwi-da but meanwhile that models is not available.

Comparing multiple systems:

When comparing multiple MT systems we encourage you to run the comet-compare command to get statistical significance with Paired T-Test and bootstrap resampling (Koehn, et al 2004).

comet-compare -s src.de -t hyp1.en hyp2.en hyp3.en -r ref.en

Minimum Bayes Risk Decoding:

The MBR command allows you to rank translations and select the best one according to COMET metrics. For more details you can read our paper on Quality-Aware Decoding for Neural Machine Translation.

comet-mbr -s [SOURCE].txt -t [MT_SAMPLES].txt --num_sample [X] -o [OUTPUT_FILE].txt

If working with a very large candidate list you can use --rerank_top_k flag to prune the topK most promissing candidates according to a reference-free metric.

Example for a candidate list of 1000 samples:

comet-mbr -s [SOURCE].txt -t [MT_SAMPLES].txt -o [OUTPUT_FILE].txt --num_sample 1000 --rerank_top_k 100 --gpus 4 --qe_model Unbabel/wmt20-comet-qe-da

COMET Models:

To evaluate your translations, we suggest using one of two models:

  • Default model: Unbabel/wmt22-comet-da - This model uses a reference-based regression approach and is built on top of XLM-R. It has been trained on direct assessments from WMT17 to WMT20 and provides scores ranging from 0 to 1, where 1 represents a perfect translation.
  • Upcoming model: Unbabel/wmt22-cometkiwi-da - This reference-free model uses a regression approach and is built on top of InfoXLM. It has been trained on direct assessments from WMT17 to WMT20, as well as direct assessments from the MLQE-PE corpus. Like the default model, it also provides scores ranging from 0 to 1.

For versions prior to 2.0, you can still use Unbabel/wmt20-comet-da, which is the primary metric, and Unbabel/wmt20-comet-qe-da for the respective reference-free version. You can find a list of all other models developed in previous versions on our MODELS page. For more information, please refer to the model licenses.

Interpreting Scores:

When using COMET to evaluate machine translation, it's important to understand how to interpret the scores it produces.

In general, COMET models are trained to predict quality scores for translations. These scores are typically normalized using a z-score transformation to account for individual differences among annotators. While the raw score itself does not have a direct interpretation, it is useful for ranking translations and systems according to their quality.

However, for the latest COMET models like Unbabel/wmt22-comet-da, we have introduced a new training approach that scales the scores between 0 and 1. This makes it easier to interpret the scores: a score close to 1 indicates a high-quality translation, while a score close to 0 indicates a translation that is no better than random chance.

It's worth noting that when using COMET to compare the performance of two different translation systems, it's important to run the comet-compare command to obtain statistical significance measures. This command compares the output of two systems using a statistical hypothesis test, providing an estimate of the probability that the observed difference in scores between the systems is due to chance. This is an important step to ensure that any differences in scores between systems are statistically significant.

Overall, the added interpretability of scores in the latest COMET models, combined with the ability to assess statistical significance between systems using comet-compare, make COMET a valuable tool for evaluating machine translation.

Languages Covered:

All the above mentioned models are build on top of XLM-R which cover the following languages:

Afrikaans, Albanian, Amharic, Arabic, Armenian, Assamese, Azerbaijani, Basque, Belarusian, Bengali, Bengali Romanized, Bosnian, Breton, Bulgarian, Burmese, Burmese, Catalan, Chinese (Simplified), Chinese (Traditional), Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Filipino, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Hausa, Hebrew, Hindi, Hindi Romanized, Hungarian, Icelandic, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Kurdish (Kurmanji), Kyrgyz, Lao, Latin, Latvian, Lithuanian, Macedonian, Malagasy, Malay, Malayalam, Marathi, Mongolian, Nepali, Norwegian, Oriya, Oromo, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Sanskri, Scottish, Gaelic, Serbian, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Sundanese, Swahili, Swedish, Tamil, Tamil Romanized, Telugu, Telugu Romanized, Thai, Turkish, Ukrainian, Urdu, Urdu Romanized, Uyghur, Uzbek, Vietnamese, Welsh, Western, Frisian, Xhosa, Yiddish.

Thus, results for language pairs containing uncovered languages are unreliable!

Scoring within Python:

from comet import download_model, load_from_checkpoint

model_path = download_model("Unbabel/wmt22-comet-da")
model = load_from_checkpoint(model_path)
data = [
    {
        "src": "Dem Feuer konnte Einhalt geboten werden",
        "mt": "The fire could be stopped",
        "ref": "They were able to control the fire."
    },
    {
        "src": "Schulen und Kindergärten wurden eröffnet.",
        "mt": "Schools and kindergartens were open",
        "ref": "Schools and kindergartens opened"
    }
]
model_output = model.predict(data, batch_size=8, gpus=1)
print(model_output)

Train your own Metric:

Instead of using pretrained models your can train your own model with the following command:

comet-train --cfg configs/models/{your_model_config}.yaml

You can then use your own metric to score:

comet-score -s src.de -t hyp1.en -r ref.en --model PATH/TO/CHECKPOINT

You can also upload your model to Hugging Face Hub. Use Unbabel/wmt22-comet-da as example. Then you can use your model directly from the hub.

unittest:

In order to run the toolkit tests you must run the following command:

poetry run coverage run --source=comet -m unittest discover
poetry run coverage report -m # Expected coverage 80%

Note: Testing on CPU takes a long time

Publications

If you use COMET please cite our work and don't forget to say which model you used!

More Repositories

1

OpenKiwi

Open-Source Machine Translation Quality Estimation in PyTorch
Python
229
star
2

MT-Telescope

Python
36
star
3

BConTrasT

Python
20
star
4

backend-engineering-challenge

Engineering Challenge for Backend candidates
16
star
5

replicant

Synthetic application testing made easy, written in Go.
Go
16
star
6

unbabel-py

Python SDK for the Unbabel api
Python
13
star
7

KiwiCutter

KiwiCutter is a simple introduction to using OpenKiwi
Mathematica
13
star
8

webpack-flask-starter

Unbabel's frontend and backend starting point
Python
13
star
9

smaug

Python package to augment multilingual data
Python
10
star
10

word-level-qe-corpus-builder

Builds a WMT18-like corpus for word-level QE with annotations in the source and target words.
Python
10
star
11

sparsemax

C++
9
star
12

sparse_constrained_attention

Scripts for running the experiments in our "Sparse Constrained Attention" paper at ACL 2018.
Java
9
star
13

wmt21-qe-task

MBART Quality Estimator proposed in IST-Unbabel 2021 Submission for the Quality Estimation Shared Task
Python
9
star
14

frontend-challenge

7
star
15

caption

Automatic transcription enrichment for ASR data
Python
6
star
16

nlp-seminar

AI Reading Group
6
star
17

fullstack-coding-challenge

5
star
18

MAIA

4
star
19

translator2vec

4
star
20

unbabel-php

Unbabel PHP SDK
PHP
3
star
21

ui

Samora - The Unbabel UI component library
JavaScript
2
star
22

backend-coding-challenge

2
star
23

java-coding-challenge

Java
2
star
24

applied-ai-backend-coding-challenge

1
star
25

OpenKiwiTasting

Demonstration UI for OpenKiwi models.
Mathematica
1
star
26

data-science-challenge

Unbabel's Data Science Challenge
TeX
1
star
27

unbabel-ruby

Unbabel REST api ruby wrapper
Ruby
1
star