• Stars
    star
    450
  • Rank 97,143 (Top 2 %)
  • Language
    Python
  • License
    MIT License
  • Created over 3 years ago
  • Updated 2 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Library for translating between 200 languages. Built on 🤗 transformers.

DL Translate

DOI Downloads License

A deep learning-based translation library built on Huggingface transformers

💻 GitHub Repository
📚 Documentation
🐍 PyPi project
🧪 Colab Demo / Kaggle Demo

Quickstart

Install the library with pip:

pip install dl-translate

To translate some text:

import dl_translate as dlt

mt = dlt.TranslationModel()  # Slow when you load it for the first time

text_hi = "संयुक्त राष्ट्र के प्रमुख का कहना है कि सीरिया में कोई सैन्य समाधान नहीं है"
mt.translate(text_hi, source=dlt.lang.HINDI, target=dlt.lang.ENGLISH)

Above, you can see that dlt.lang contains variables representing each of the 50 available languages with auto-complete support. Alternatively, you can specify the language (e.g. "Arabic") or the language code (e.g. "fr" for French):

text_ar = "الأمين العام للأمم المتحدة يقول إنه لا يوجد حل عسكري في سوريا."
mt.translate(text_ar, source="Arabic", target="fr")

If you want to verify whether a language is available, you can check it:

print(mt.available_languages())  # All languages that you can use
print(mt.available_codes())  # Code corresponding to each language accepted
print(mt.get_lang_code_map())  # Dictionary of lang -> code

Usage

Selecting a device

When you load the model, you can specify the device:

mt = dlt.TranslationModel(device="auto")

By default, the value will be device="auto", which means it will use a GPU if possible. You can also explicitly set device="cpu" or device="gpu", or some other strings accepted by torch.device(). In general, it is recommend to use a GPU if you want a reasonable processing time.

Choosing a different model

By default, the m2m100 model will be used. However, there are a few options:

  • mBART-50 Large: Allows translations across 50 languages.
  • m2m100: Allows translations across 100 languages.
  • nllb-200 (New in v0.3): Allows translations across 200 languages, and is faster than m2m100 (On RTX A6000, we can see speed up of 3x).

Here's an example:

# The default approval
mt = dlt.TranslationModel("m2m100")  # Shorthand
mt = dlt.TranslationModel("facebook/m2m100_418M")  # Huggingface repo

# If you want to use mBART-50 Large
mt = dlt.TranslationModel("mbart50")
mt = dlt.TranslationModel("facebook/mbart-large-50-many-to-many-mmt")

# Or NLLB-200 (faster and has 200 languages)
mt = dlt.TranslationModel("nllb200")
mt = dlt.TranslationModel("facebook/nllb-200-distilled-600M")

Note that the language code will change depending on the model family. To find out the correct language codes, please read the doc page on available languages or run mt.available_codes().

By default, dlt.TranslationModel will download the model from the huggingface repo for mbart50, m2m100, or nllb200 and cache it. It's possible to load the model from a path or a model with a similar format, but you will need to specify the model_family:

mt = dlt.TranslationModel("/path/to/model/directory/", model_family="mbart50")
mt = dlt.TranslationModel("facebook/m2m100_1.2B", model_family="m2m100")
mt = dlt.TranslationModel("facebook/nllb-200-distilled-600M", model_family="nllb200")

Notes:

  • Make sure your tokenizer is also stored in the same directory if you load from a file.
  • The available languages will change if you select a different model, so you will not be able to leverage dlt.lang or dlt.utils.

Splitting into sentences

It is not recommended to use extremely long texts as it takes more time to process. Instead, you can try to break them down into sentences with the help of nltk. First install the library with pip install nltk, then run:

import nltk

nltk.download("punkt")

text = "Mr. Smith went to his favorite cafe. There, he met his friend Dr. Doe."
sents = nltk.tokenize.sent_tokenize(text, "english")  # don't use dlt.lang.ENGLISH
" ".join(mt.translate(sents, source=dlt.lang.ENGLISH, target=dlt.lang.FRENCH))

Batch size during translation

It's possible to set a batch size (i.e. the number of elements processed at once) for mt.translate and whether you want to see the progress bar or not:

# ...
mt = dlt.TranslationModel()
mt.translate(text, source, target, batch_size=32, verbose=True)

If you set batch_size=None, it will compute the entire text at once rather than splitting into "chunks". We recommend lowering batch_size if you do not have a lot of RAM or VRAM and run into CUDA memory error. Set a higher value if you are using a high-end GPU and the VRAM is not fully utilized.

dlt.utils module

An alternative to mt.available_languages() is the dlt.utils module. You can use it to find out which languages and codes are available:

print(dlt.utils.available_languages('mbart50'))  # All languages that you can use
print(dlt.utils.available_codes('m2m100'))  # Code corresponding to each language accepted
print(dlt.utils.get_lang_code_map('nllb200'))  # Dictionary of lang -> code

Offline usage

Unlike the Google translate or MSFT Translator APIs, this library can be fully used offline. However, you will need to first download the packages and models, and move them to your offline environment to be installed and loaded inside a venv.

First, run in your terminal:

mkdir dlt
cd dlt
mkdir libraries
pip download -d libraries/ dl-translate

Once all the required packages are downloaded, you will need to use huggingface hub to download the files. Install it with pip install huggingface-hub. Then, run inside Python:

import shutil
import huggingface_hub as hub

dirname = hub.snapshot_download("facebook/m2m100_418M")
shutil.copytree(dirname, "cached_model_m2m100")  # Copy to a permanent folder

Now, move everything in the dlt directory to your offline environment. Create a virtual environment and run the following in terminal:

pip install --no-index --find-links libraries/ dl-translate

Now, run inside Python:

import dl_translate as dlt

mt = dlt.TranslationModel("cached_model_m2m100", model_family="m2m100")

Advanced

If you have knowledge of PyTorch and Huggingface Transformers, you can access advanced aspects of the library for more customization:

  • Saving and loading: If you wish to accelerate the loading time the translation model, you can use save_obj and reload it later with load_obj. This method is only recommended if you are familiar with huggingface and torch; please read the docs for more information.
  • Interacting with underlying model and tokenizer: When initializing model, you can pass in arguments for the underlying BART model and tokenizer with model_options and tokenizer_options respectively. You can also access the underlying transformers with mt.get_transformers_model().
  • Keyword arguments for the generate() method: When running mt.translate, you can also give generation_options that is passed to the generate() method of the underlying transformer model.

For more information, please visit the advanced section of the user guide.

Acknowledgement

dl-translate is built on top of Huggingface's implementation of two models created by Facebook AI Research.

  1. The multilingual BART finetuned on many-to-many translation of over 50 languages, which is documented here The original paper was written by Tang et. al from Facebook AI Research; you can find it here and cite it using the following:

    @article{tang2020multilingual,
        title={Multilingual translation with extensible multilingual pretraining and finetuning},
        author={Tang, Yuqing and Tran, Chau and Li, Xian and Chen, Peng-Jen and Goyal, Naman and Chaudhary, Vishrav and Gu, Jiatao and Fan, Angela},
        journal={arXiv preprint arXiv:2008.00401},
        year={2020}
    }
    
  2. The transformer model published in Beyond English-Centric Multilingual Machine Translation by Fan et. al, which supports over 100 languages. You can cite it here:

    @misc{fan2020englishcentric,
         title={Beyond English-Centric Multilingual Machine Translation}, 
         author={Angela Fan and Shruti Bhosale and Holger Schwenk and Zhiyi Ma and Ahmed El-Kishky and Siddharth Goyal and Mandeep Baines and Onur Celebi and Guillaume Wenzek and Vishrav Chaudhary and Naman Goyal and Tom Birch and Vitaliy Liptchinsky and Sergey Edunov and Edouard Grave and Michael Auli and Armand Joulin},
         year={2020},
         eprint={2010.11125},
         archivePrefix={arXiv},
         primaryClass={cs.CL}
     }
    
  3. The no language left behind model, which extends NMT to 200+ languages. You can cite it here:

    @misc{nllbteam2022language,
        title={No Language Left Behind: Scaling Human-Centered Machine Translation}, 
        author={NLLB Team and Marta R. Costa-jussà and James Cross and Onur Çelebi and Maha Elbayad and Kenneth Heafield and Kevin Heffernan and Elahe Kalbassi and Janice Lam and Daniel Licht and Jean Maillard and Anna Sun and Skyler Wang and Guillaume Wenzek and Al Youngblood and Bapi Akula and Loic Barrault and Gabriel Mejia Gonzalez and Prangthip Hansanti and John Hoffman and Semarley Jarrett and Kaushik Ram Sadagopan and Dirk Rowe and Shannon Spruit and Chau Tran and Pierre Andrews and Necip Fazil Ayan and Shruti Bhosale and Sergey Edunov and Angela Fan and Cynthia Gao and Vedanuj Goswami and Francisco Guzmán and Philipp Koehn and Alexandre Mourachko and Christophe Ropers and Safiyyah Saleem and Holger Schwenk and Jeff Wang},
        year={2022},
        eprint={2207.04672},
        archivePrefix={arXiv},
        primaryClass={cs.CL}
    }
    

dlt is a wrapper with useful utils to save you time. For huggingface's transformers, the following snippet is shown as an example:

from transformers import MBartForConditionalGeneration, MBart50TokenizerFast

article_hi = "संयुक्त राष्ट्र के प्रमुख का कहना है कि सीरिया में कोई सैन्य समाधान नहीं है"
article_ar = "الأمين العام للأمم المتحدة يقول إنه لا يوجد حل عسكري في سوريا."

model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")
tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")

# translate Hindi to French
tokenizer.src_lang = "hi_IN"
encoded_hi = tokenizer(article_hi, return_tensors="pt")
generated_tokens = model.generate(**encoded_hi, forced_bos_token_id=tokenizer.lang_code_to_id["fr_XX"])
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
# => "Le chef de l 'ONU affirme qu 'il n 'y a pas de solution militaire en Syria."

# translate Arabic to English
tokenizer.src_lang = "ar_AR"
encoded_ar = tokenizer(article_ar, return_tensors="pt")
generated_tokens = model.generate(**encoded_ar, forced_bos_token_id=tokenizer.lang_code_to_id["en_XX"])
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
# => "The Secretary-General of the United Nations says there is no military solution in Syria."

With dlt, you can run:

import dl_translate as dlt

article_hi = "संयुक्त राष्ट्र के प्रमुख का कहना है कि सीरिया में कोई सैन्य समाधान नहीं है"
article_ar = "الأمين العام للأمم المتحدة يقول إنه لا يوجد حل عسكري في سوريا."

mt = dlt.TranslationModel()
translated_fr = mt.translate(article_hi, source=dlt.lang.HINDI, target=dlt.lang.FRENCH)
translated_en = mt.translate(article_ar, source=dlt.lang.ARABIC, target=dlt.lang.ENGLISH)

Notice you don't have to think about tokenizers, condition generation, pretrained models, and regional codes; you can just tell the model what to translate!

If you are experienced with huggingface's ecosystem, then you should be familiar enough with the example above that you wouldn't need this library. However, if you've never heard of huggingface or mBART, then I hope using this library will give you enough motivation to learn more about them :)

More Repositories

1

bm25s

Fast lexical search library implementing BM25 in Python using Numpy and Scipy
Python
735
star
2

react-pyodide-template

A simple template to get started with pyodide inside React
JavaScript
41
star
3

covid-qa

A collection of COVID-19 question-answer pairs and transformer baselines for evaluating QA models (Official Repository)
Jupyter Notebook
24
star
4

keras-noisy-student

EfficientNet-L2 weights in Keras and retrieval script modified from qubvel/efficientnet
Python
23
star
5

dash-draggable

react-draggable in Python
Python
19
star
6

keras-toolkit

A collection of functions to help you easily train and run Tensorflow Keras. It includes 1-line auto-TPU support, GPU memory management, and tf.data builders.
Python
12
star
7

pyodide-html

HTML elements for pyodide, implemented as Python functions
Python
11
star
8

material-ui-in-pyodide

Python
8
star
9

llama-2-local-ui

Chat UI for locally-hosted LLaMA-2
Python
7
star
10

awesome-ml-visualization

Curated list of awesome ML Visualization Libraries
7
star
11

dash-katex

Katex.js in Python using Dash
Python
6
star
12

simple-pubsub

A simple repository that implements a redis-style pub/sub with pure Python
Python
6
star
13

dash-projects

DEPRECATED: Moved to xhlulu/projects
5
star
14

bm25-benchmarks

Python
5
star
15

learn-programming-resources

My personal recommendations of resources to learn programming
4
star
16

pyodide-image-resize

React app for client-side image resizing, powered by Pillow+Pyodide
JavaScript
4
star
17

react-in-python

Example using react and hooks inside pyodide with pure Python code. Based on: https://alpha.iodide.io/notebooks/420/
HTML
4
star
18

plm

Helps you manage python libraries and environments
Python
3
star
19

pyodide-mui-modified

Python
3
star
20

awesome-kaggle-kernels

A curated list of awesome Kaggle Kernels for Exploratory Analysis and Model Tuning
3
star
21

react-pyodide-iris

A simple react app that lets you train a model on the iris dataset and predict on custom input; all using scikit-learn and pyodide
JavaScript
3
star
22

latex-vscode-template

A template repository for latex in vscode (via Latex Workshop), with GPT-written instructions on setting it up
TeX
3
star
23

projects

All my programming projects
3
star
24

arxiv-assistant

A simple webapp for helping you navigate Arxiv.org
CSS
2
star
25

pyodide-flask-template

HTML
2
star
26

Bixi-ML-Analysis

Jupyter Notebook
2
star
27

dash-template

Simple cookiecutter template for creating dash apps
Python
2
star
28

wikicat

Toolkit for managing and navigating graphs of Wikipedia categories
Python
2
star
29

ml-health-conferences

A curated list of Machine Learning academic conferences and workshops that accept health science and public health submissions
2
star
30

papers

Papers that I have published
2
star
31

qr-code

A basic, free, ad-less, PWA-ready, open-source QR Code generator
HTML
2
star
32

convert-html-to-dash

this is just an experiment, not a complete project
Python
2
star
33

dash-webcam

Dash Component wrapping React-Webcam
Python
1
star
34

dash-ml-plots

1
star
35

dash-functional-component

Python
1
star
36

try-workflow-comments

1
star
37

images

1
star
38

notebooks

1
star
39

pyodide-multi-module-template

Example repo for github issue, see below:
Python
1
star
40

simple-pip-example

Simple example showing how to use setup.py
Python
1
star
41

dash-umap

1
star
42

xhlulu

1
star
43

pyscript-react-demo

HTML
1
star
44

awesome-covid-19-data-science

Amazing models, datasets, APIs, websites, etc. for tackling COVID-19. All resources have to relate to data science or machine learning; for everything else, please check out awesome-coronavirus.
1
star
45

pyodide-plotly-app

Python
1
star
46

kaggle-images

1
star
47

jekyll-tests

1
star
48

dash-vtk-cfd

Python
1
star
49

test

just testing
1
star
50

data

Various (miscellaneous) datasets
1
star
51

reproduce-cytoscape-docs-errors

CSS
1
star
52

nested_children

JavaScript
1
star
53

scikit-rf-apps

CSS
1
star
54

dash_leaflet

React-leaflet in Python
JavaScript
1
star
55

sample-files

Sample files
1
star
56

image-dump

1
star
57

keras-weights

HTML
1
star
58

dash-bixi-usage

Dash Demo App - Plotting one million bixi trips with Plotly Scattergl
1
star