• Stars
    star
    612
  • Rank 73,263 (Top 2 %)
  • Language
    Python
  • License
    MIT License
  • Created about 5 years ago
  • Updated 26 days ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Open neural machine translation models and web services

OPUS-MT

Tools and resources for open translation services

This repository includes two setups:

There are also scripts for training models, but those are currently only useful in the computing environment used by the University of Helsinki and CSC as the IT service provider.

Please cite the following paper if you use OPUS-MT software and models:

@InProceedings{TiedemannThottingal:EAMT2020,
  author = {J{\"o}rg Tiedemann and Santhosh Thottingal},
  title = {{OPUS-MT} β€” {B}uilding open translation services for the {W}orld},
  booktitle = {Proceedings of the 22nd Annual Conferenec of the European Association for Machine Translation (EAMT)},
  year = {2020},
  address = {Lisbon, Portugal}
 }

Installation of the Tornado-based Web-App

Download the latest version from github:

git clone https://github.com/Helsinki-NLP/Opus-MT.git

Option 1: Manual setup

Install Marian MT. Follow the documentation at https://marian-nmt.github.io/docs/ After the installation, marian-server is expected to be present in path. If not, place it in /usr/local/bin

Install pre-requisites. Using a virtual environment is recommended.

pip install -r requirements.txt

Download the translation models from https://github.com/Helsinki-NLP/Opus-MT-train/tree/master/models and place them in models directory.

Then edit the services.json to point to those models.

And start the webserver.

python server.py

By default, it will use port 8888. Launch your browser to localhost:8888 to get the web interface. The languages configured in services.json will be available.

Option 2: Using Docker

docker-compose up

or

docker build . -t opus-mt
docker run -p 8888:8888 opus-mt:latest

And launch your browser to localhost:8888

Option 2.1: Using Docker with CUDA GPU

docker build -f Dockerfile.gpu . -t opus-mt-gpu
nvidia-docker run -p 8888:8888 opus-mt-gpu:latest

And launch your browser to localhost:8888

Configuration

The server.py program accepts a configuration file in json format. By default it try to use services.json in the current directory. But you can give a custom one using -c flag.

An example configuration file looks like this:

{
    "en": {
        "es": {
            "configuration": "./models/en-es/decoder.yml",
            "host": "localhost",
            "port": "10001"
        },
        "fi": {
            "configuration": "./models/en-fi/decoder.yml",
            "host": "localhost",
            "port": "10002"
        },
    }
}

This example configuration can provide MT service for en->es and en->fi language pairs.

  • configuration points to a yaml file containing the decoder configuration usable by marian-server. If this value is not provided, Opus-MT will assume that the service is already running in a remote host and post as given in other options. If value is provided, a new subprocess will be created using marian-server
  • host: The host where the server is running.
  • port: The port to be listen for marian-server

Installation of a websocket service on Ubuntu

There is another option of setting up translation services using WebSockets and Linux services. Detailed information is available from doc/WebSocketServer.md.

Public MT models

We store public models (CC-BY 4.0 License) at https://github.com/Helsinki-NLP/Opus-MT-train/tree/master/models They should all be compatible with the OPUS-MT services, and you can install them by specifying the language pair. The installation script takes the latest model in that directory. For additional customisation you need to adjust the installation procedures (in the Makefile or elsewhere).

There are also development versions of models, which are often a bit more experimental and of low quality. But there are additional language pairs and they can be downloaded from https://github.com/Helsinki-NLP/Opus-MT-train/tree/master/work-spm/models

Train MT models

There is a Makefile for training new models from OPUS data in the Opus-MT-train repository, but this is heavily customized for the work environment at CSC and the University of Helsinki projects. This will (hopefully) be more generic in the future to be able to run in different environments and setups as well.

Known issues

  • most automatic evaluations are made on simple and short sentences from the Tatoeba data collection; those scores will be too optimistic when running the models with other more realistic data sets
  • Some (older) test results are not reliable as they use software localisation data (namely GNOME system messages) with a large overlap with other localisation data (i.e. Ubuntu system messages) that are included in the training data
  • All current models are trained without filtering, data augmentation (like backfanslation) and domain adaptation and other optimisation procedures; there is no quality control besides of the automatic evaluation based on automatically selected test sets; for some language pairs there are at least also benchmark scores from official WMT test sets
  • Most models are trained with a maximum of 72 training hours on 1 or 4 GPUs; not all of them converged before this time limit
  • Validation and early stopping is based on automatically selected validation data, often from Tatoeba; the validation data is not representative for many applications

To-Do and wish list

  • more languages and language pairs
  • better and more multilingual models
  • optimize translation performance
  • add backtranslation data
  • domain-specific models
  • GPU enabled container
  • dockerized fine-tuning
  • document-level models
  • load-balancing and other service optimisations
  • public MT service network
  • feedback loop and personalisation

Links and related work

Acknowledgements

The work is supported by the European Language Grid as pilot project 2866, by the FoTran project, funded by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 771113), and the MeMAD project, funded by the European Union’s Horizon 2020 Research and Innovation Programme under grant agreement No 780069. We are also grateful for the generous computational resources and IT infrastructure provided by CSC -- IT Center for Science, Finland.

More Repositories

1

Tatoeba-Challenge

Makefile
797
star
2

OPUS-MT-train

Training open neural machine translation models
Makefile
330
star
3

prosody

Helsinki Prosody Corpus and A System for Predicting Prosodic Prominence from Text
Python
229
star
4

OpusFilter

OpusFilter - Parallel corpus processing toolkit
Python
101
star
5

HBMP

Sentence Embeddings in NLI with Iterative Refinement Encoders
Python
78
star
6

OPUS-CAT

OPUS-CAT is a collection of software which make it possible to OPUS-MT neural machine translation models in professional translation. OPUS-CAT includes a local offline MT engine and a collection of CAT tool plugins.
C#
69
star
7

OpusTools

Python
67
star
8

XED

XED multilingual emotion datasets
Jupyter Notebook
55
star
9

OPUS

The Open Parallel Corpus
JavaScript
54
star
10

UkrainianLT

A collection of links to Ukrainian language tools
29
star
11

OPUS-translator

Translation demonstrator
Smalltalk
27
star
12

mammoth

MAMMOTH: MAssively Multilingual Modular Open Translation @ Helsinki
Python
21
star
13

MuCoW

Automatically harvested multilingual contrastive word sense disambiguation test sets for machine translation
Python
16
star
14

subalign

Perl
15
star
15

sentimentator

Tool for sentiment analysis annotation
HTML
11
star
16

OPUS-MT-testsets

benchmarks for evaluating MT models
Smalltalk
10
star
17

OpusTools-perl

Perl
6
star
18

neural-search-tutorials

Additional Notebooks for the Building NLP Applications course
Jupyter Notebook
5
star
19

OPUS-interface

OPUS repository interface
Python
5
star
20

OPUS-ingest

Makefile
4
star
21

LanguageCodes

Perl
4
star
22

shroom

SCSS
4
star
23

nli-data-sanity-check

Data and scripts for a diagnostics test suite which allows to assess whether an NLU dataset constitutes a good testbed for evaluating the models' meaning understanding capabilities.
Jupyter Notebook
4
star
24

OPUS-repository

Perl
3
star
25

doclevel-MT-benchmark

Document-level Machine Translation Benchmark
Python
3
star
26

Uplug

HTML
3
star
27

americasnlp2021-st

AmericasNLP 2021 shared task
JavaScript
3
star
28

Geometry

Python
2
star
29

shared-info

2
star
30

LSDC

Low-Saxon Dialect Classification
2
star
31

pdf2xml

Perl
2
star
32

Syntactic_Debiasing

Python
2
star
33

OpusTranslationService

Translation service based on LibreTranslate
Python
2
star
34

murre24

Manually annotated dataset of Finnish varieties in the Suomi24, the largest Finnish internet forum, the id's of automatically annotated dialectal messages and the scripts used for classification and evaluation.
Python
2
star
35

OPUS-index

Index of resources in OPUS
1
star
36

OpusFilter-hub

A hub of OpusFilter configurations
Python
1
star
37

NLU-Course-2020

Python
1
star
38

SELF-FEIL

Emotion Lexicons for Finnish
1
star
39

ndc-aligned

Word-aligned version of the Norwegian Dialect Corpus
Python
1
star
40

OPUS-MT-dashboard

PHP
1
star
41

External-MT-leaderboard

Leaderboards for external MT models
1
star
42

nlu-dataset-diagnostics

This repository contains data and scripts to reproduce the results from our paper: How Does Data Corruption Affect Natural Language Understanding Models? A Study on GLUE datasets.
Python
1
star
43

en-fi-testsuite

WMT18 Testsuite for Finnish morphology
Python
1
star
44

finlandsvensk-AI

1
star
45

OPUS-website

OPUS website files
1
star
46

OPUS-MT-leaderboard-recipes

Makefile recipes shared between all leaderboard repos
Makefile
1
star
47

OPUS-MT-leaderboard

1
star
48

murreviikko

Dialectologically annotated and normalized dataset of dialectal Finnish tweets
Python
1
star
49

Sami-MT

machine translation for SΓ‘mi languages
1
star
50

lm-vs-mt

Two Stacks Are Better Than One: A Comparison of Language Modeling and Translation as Multilingual Pretraining Objectives
Python
1
star
51

OPUS-API

API for searching corpora from OPUS
Python
1
star
52

dialect-topic-model

Scripts and metadata for the paper "Corpus-based dialectometry with topic models"
Python
1
star