GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators

[Paper] [Data] [Model]

This work proposes a generative paradigm for translation tasks that leverages LLMs to generate higher-quality translation results based on the N-best hypotheses decoded from foundation model (e.g., SeamlessM4T-Large-V2). We also release a HypoTranslate dataset to support LLM finetuning, which contains over 592K pairs of N-best hypotheses and ground-truth translation in 11 languages. Experiments show that our GenTranslate significantly outperforms the state-of-the-art SeamlessM4T-Large-V2 on various speech and machine translation benchmarks.

TIP: At this time (before publication), we provide inference script, test data and partial well-trained models only for inference use. Full-version resources of this paper, including training script, the entire HypoTranslate dataset and all the models, will be open sourced upon publication to benefit the community.

Conda Environment Configuration

Our code is built based on lit-gpt, please refer to official tutorial to build the conda environment. Then, please install the required packages using following command:

pip install -r requirements.txt

Code

Model code: lit_gpt/gentrans.py;
Inference script: infer.sh;

Models

For LLMs, please refer to tutorial for configuration steps, which support many mainstream LLMs like LLaMA-2;
For well-trained adapter checkpoints, please refer to our HuggingFace repo.

Dataset

We have released our HypoTranslate dataset at HuggingFace.

Inference Usage

We provide two well-trained models and corresponding test sets for inference use, i.e., FLEURS Fr-En and En-Fr ST tasks. Before running inference, please follow the steps below for preparation:

Go to infer.sh:
- Specify you conda environment <your-conda-env>;
- Specify the source-target language pair, where we provide two example pairs fr-en and en-fr;
- Specify the LLM size: 7b for fr-en, 13b for en-fr;
Download and convert LLaMA-2 pre-trained checkpoint:
- Please refer to official tutorial to configure Llama-2-7b-hf and Llama-2-13b-hf;
Go to inference/gentrans.py:
- Specify the experiment directory exp_dir: the root path of this README.md file;
- Specify the data directory data_dir: the absolute path of test data (.pt file);
- Specify the LLM directory llm_dir: the absolute path of your downloaded LLaMA-2 checkpoint;
- Specify the adapter directory adapter_dir: the absolute path of our released adapter checkpoint;

Now you can run inference on your specified language direction by:

bash infer.sh

You will see the BLEU results of GenTranslate on your specified test set.

References

@article{hu2024gentranslate,
  title={GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators},
  author={Hu, Yuchen and Chen, Chen and Yang, Chao-Han Huck and Li, Ruizhe and Zhang, Dong and Chen, Zhehuai and Chng, Eng Siong},
  journal={arXiv preprint arXiv:2402.06894},
  year={2024}
}

YUCHEN005/GenTranslate

YUCHEN005

Reviews

Repository Details