• Stars
    star
    123
  • Rank 290,145 (Top 6 %)
  • Language
  • License
    GNU Affero Genera...
  • Created over 2 years ago
  • Updated 4 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A Vietnamese-English Neural Machine Translation System (INTERSPEECH 2022)

A Vietnamese-English Neural Machine Translation System

Our pre-trained VinAI Translate models are state-of-the-art text-to-text translation models for Vietnamese-to-English and English-to-Vietnamese, respectively. The general architecture and experimental results of the pre-trained models can be found in our VinAI Translate system paper:

@inproceedings{vinaitranslate,
title     = {{A Vietnamese-English Neural Machine Translation System}},
author    = {Thien Hai Nguyen and 
             Tuan-Duy H. Nguyen and 
             Duy Phung and 
             Duy Tran-Cong Nguyen and 
             Hieu Minh Tran and 
             Manh Luong and 
             Tin Duy Vo and 
             Hung Hai Bui and 
             Dinh Phung and 
             Dat Quoc Nguyen},
booktitle = {Proceedings of the 23rd Annual Conference of the International Speech Communication Association: Show and Tell (INTERSPEECH)},
year      = {2022}
}

Please CITE our paper whenever the pre-trained models or the system are used to help produce published results or incorporated into other software.

Pre-trained models

Model Max length License Note
vinai/vinai-translate-vi2en-v2 1024 GNU AGPL-3.0 Further fine-tuning vinai/vinai-translate-vi2en on a combination with mTet-v2 data (with duplication removal)
vinai/vinai-translate-en2vi-v2 1024 GNU AGPL-3.0 Further fine-tuning vinai/vinai-translate-en2vi on a combination with mTet-v2 data (with duplication removal)
vinai/vinai-translate-vi2en 1024 GNU AGPL-3.0 448M parameters
vinai/vinai-translate-en2vi 1024 GNU AGPL-3.0 448M parameters

Evaluation

Screen Shot 2023-11-17 at 10 07 05 pm

*: Data leakage issue where ~10K training pairs in mTet-v2 appear in the PhoMT test set.

See HERE for evaluation scripts of VinAI Translate and VietAI envit5-translation.

Using VinAI Translate in transformers

Vietnamese-to-English translation

GPU-based batch translation example

import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer_vi2en = AutoTokenizer.from_pretrained("vinai/vinai-translate-vi2en-v2", src_lang="vi_VN")
model_vi2en = AutoModelForSeq2SeqLM.from_pretrained("vinai/vinai-translate-vi2en-v2")
device_vi2en = torch.device("cuda")
model_vi2en.to(device_vi2en)


def translate_vi2en(vi_texts: str) -> str:
    input_ids = tokenizer_vi2en(vi_texts, padding=True, return_tensors="pt").to(device_vi2en)
    output_ids = model_vi2en.generate(
        **input_ids,
        decoder_start_token_id=tokenizer_vi2en.lang_code_to_id["en_XX"],
        num_return_sequences=1,
        num_beams=5,
        early_stopping=True
    )
    en_texts = tokenizer_vi2en.batch_decode(output_ids, skip_special_tokens=True)
    return en_texts

# The input may consist of multiple text sequences, with the number of text sequences in the input ranging from 1 up to 8, 16, 32, or even higher, depending on the GPU memory.
vi_texts = ["CΓ΄ cho biαΊΏt: trΖ°α»›c giờ tΓ΄i khΓ΄ng Δ‘αΊΏn phΓ²ng tαΊ­p cΓ΄ng cα»™ng, mΓ  tαΊ­p cΓΉng giΓ‘o viΓͺn Yoga riΓͺng hoαΊ·c tα»± tαΊ­p ở nhΓ .",
            "Khi tαΊ­p thể dα»₯c trong khΓ΄ng gian riΓͺng tΖ°, tΓ΄i thoαΊ£i mΓ‘i dα»… chα»‹u hΖ‘n.",
            "cΓ΄ cho biαΊΏt trΖ°α»›c giờ tΓ΄i khΓ΄ng Δ‘αΊΏn phΓ²ng tαΊ­p cΓ΄ng cα»™ng mΓ  tαΊ­p cΓΉng giΓ‘o viΓͺn yoga riΓͺng hoαΊ·c tα»± tαΊ­p ở nhΓ  khi tαΊ­p thể dα»₯c trong khΓ΄ng gian riΓͺng tΖ° tΓ΄i thoαΊ£i mΓ‘i dα»… chα»‹u hΖ‘n"]
print(translate_vi2en(vi_texts))

CPU-based sequence translation example

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer_vi2en = AutoTokenizer.from_pretrained("vinai/vinai-translate-vi2en-v2", src_lang="vi_VN")
model_vi2en = AutoModelForSeq2SeqLM.from_pretrained("vinai/vinai-translate-vi2en-v2")

def translate_vi2en(vi_text: str) -> str:
    input_ids = tokenizer_vi2en(vi_text, return_tensors="pt").input_ids
    output_ids = model_vi2en.generate(
        input_ids,
        decoder_start_token_id=tokenizer_vi2en.lang_code_to_id["en_XX"],
        num_return_sequences=1,
        num_beams=5,
        early_stopping=True
    )
    en_text = tokenizer_vi2en.batch_decode(output_ids, skip_special_tokens=True)
    en_text = " ".join(en_text)
    return en_text

vi_text = "CΓ΄ cho biαΊΏt: trΖ°α»›c giờ tΓ΄i khΓ΄ng Δ‘αΊΏn phΓ²ng tαΊ­p cΓ΄ng cα»™ng, mΓ  tαΊ­p cΓΉng giΓ‘o viΓͺn Yoga riΓͺng hoαΊ·c tα»± tαΊ­p ở nhΓ . Khi tαΊ­p thể dα»₯c trong khΓ΄ng gian riΓͺng tΖ°, tΓ΄i thoαΊ£i mΓ‘i dα»… chα»‹u hΖ‘n."
print(translate_vi2en(vi_text))

vi_text = "cΓ΄ cho biαΊΏt trΖ°α»›c giờ tΓ΄i khΓ΄ng Δ‘αΊΏn phΓ²ng tαΊ­p cΓ΄ng cα»™ng mΓ  tαΊ­p cΓΉng giΓ‘o viΓͺn yoga riΓͺng hoαΊ·c tα»± tαΊ­p ở nhΓ  khi tαΊ­p thể dα»₯c trong khΓ΄ng gian riΓͺng tΖ° tΓ΄i thoαΊ£i mΓ‘i dα»… chα»‹u hΖ‘n"
print(translate_vi2en(vi_text))

English-to-Vietnamese translation

GPU-based batch translation example

import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer_en2vi = AutoTokenizer.from_pretrained("vinai/vinai-translate-en2vi-v2", src_lang="en_XX")
model_en2vi = AutoModelForSeq2SeqLM.from_pretrained("vinai/vinai-translate-en2vi-v2")
device_en2vi = torch.device("cuda")
model_en2vi.to(device_en2vi)


def translate_en2vi(en_texts: str) -> str:
    input_ids = tokenizer_en2vi(en_texts, padding=True, return_tensors="pt").to(device_en2vi)
    output_ids = model_en2vi.generate(
        **input_ids,
        decoder_start_token_id=tokenizer_en2vi.lang_code_to_id["vi_VN"],
        num_return_sequences=1,
        num_beams=5,
        early_stopping=True
    )
    vi_texts = tokenizer_en2vi.batch_decode(output_ids, skip_special_tokens=True)
    return vi_texts

# The input may consist of multiple text sequences, with the number of text sequences in the input ranging from 1 up to 8, 16, 32, or even higher, depending on the GPU memory.
en_texts = ["I haven't been to a public gym before.",
            "When I exercise in a private space, I feel more comfortable.",
            "i haven't been to a public gym before when i exercise in a private space i feel more comfortable"]
print(translate_en2vi(en_texts))

CPU-based sequence translation example

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer_en2vi = AutoTokenizer.from_pretrained("vinai/vinai-translate-en2vi-v2", src_lang="en_XX")
model_en2vi = AutoModelForSeq2SeqLM.from_pretrained("vinai/vinai-translate-en2vi-v2")

def translate_en2vi(en_text: str) -> str:
    input_ids = tokenizer_en2vi(en_text, return_tensors="pt").input_ids
    output_ids = model_en2vi.generate(
        input_ids,
        decoder_start_token_id=tokenizer_en2vi.lang_code_to_id["vi_VN"],
        num_return_sequences=1,
        num_beams=5,
        early_stopping=True
    )
    vi_text = tokenizer_en2vi.batch_decode(output_ids, skip_special_tokens=True)
    vi_text = " ".join(vi_text)
    return vi_text

en_text = "I haven't been to a public gym before. When I exercise in a private space, I feel more comfortable."
print(translate_en2vi(en_text))

en_text = "i haven't been to a public gym before when i exercise in a private space i feel more comfortable"
print(translate_en2vi(en_text))

More Repositories

1

PhoGPT

PhoGPT: Generative Pre-training for Vietnamese (2023)
Python
720
star
2

PhoBERT

PhoBERT: Pre-trained language models for Vietnamese (EMNLP-2020 Findings)
658
star
3

BERTweet

BERTweet: A pre-trained language model for English Tweets (EMNLP-2020)
Python
573
star
4

WaveDiff

Official Pytorch Implementation of the paper: Wavelet Diffusion Models are fast and scalable Image Generators (CVPR'23)
Python
372
star
5

CPM

πŸ’„ Lipstick ain't enough: Beyond Color-Matching for In-the-Wild Makeup Transfer (CVPR 2021)
Python
364
star
6

XPhoneBERT

XPhoneBERT: A Pre-trained Multilingual Model for Phoneme Representations for Text-to-Speech (INTERSPEECH 2023)
Python
292
star
7

Anti-DreamBooth

Anti-DreamBooth: Protecting users from personalized text-to-image synthesis (ICCV 2023)
Python
205
star
8

LFM

Official PyTorch implementation of the paper: Flow Matching in Latent Space
Python
184
star
9

blur-kernel-space-exploring

Exploring Image Deblurring via Blur Kernel Space (CVPR'21)
Python
137
star
10

PhoNLP

PhoNLP: A BERT-based multi-task learning model for part-of-speech tagging, named entity recognition and dependency parsing (NAACL 2021)
Python
137
star
11

dict-guided

Dictionary-guided Scene Text Recognition (CVPR-2021)
Python
126
star
12

MagNet

Progressive Semantic Segmentation (CVPR-2021)
Python
114
star
13

Warping-based_Backdoor_Attack-release

WaNet - Imperceptible Warping-based Backdoor Attack (ICLR 2021)
Python
111
star
14

HyperInverter

HyperInverter: Improving StyleGAN Inversion via Hypernetwork (CVPR 2022)
Python
111
star
15

BARTpho

BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese (INTERSPEECH 2022)
99
star
16

PhoWhisper

PhoWhisper: Automatic Speech Recognition for Vietnamese (2024)
96
star
17

ISBNet

ISBNet: a 3D Point Cloud Instance Segmentation Network with Instance-aware Sampling and Box-aware Dynamic Convolution (CVPR 2023)
Python
93
star
18

Dataset-Diffusion

Dataset Diffusion: Diffusion-based Synthetic Data Generation for Pixel-Level Semantic Segmentation (NeurIPS2023)
Jupyter Notebook
87
star
19

JointIDSF

BERT-based joint intent detection and slot filling with intent-slot attention mechanism (INTERSPEECH 2021)
Python
84
star
20

3D-UCaps

3D-UCaps: 3D Capsules Unet for Volumetric Image Segmentation (MICCAI 2021)
Python
65
star
21

PhoNER_COVID19

COVID-19 Named Entity Recognition for Vietnamese (NAACL 2021)
63
star
22

PCC-pytorch

A pytorch implementation of the paper "Prediction, Consistency, Curvature: Representation Learning for Locally-Linear Control"
Python
59
star
23

Counting-DETR

Few-shot Object Counting and Detection (ECCV 2022)
Python
56
star
24

PSENet-Image-Enhancement

PSENet: Progressive Self-Enhancement Network for Unsupervised Extreme-Light Image Enhancement (WACV 2023)
Python
54
star
25

LeMul

Toward Realistic Single-View 3D Object Reconstruction with Unsupervised Learning from Multiple Images (ICCV 2021)
Python
51
star
26

DSW

Distributional Sliced-Wasserstein distance code
Python
47
star
27

PhoMT

PhoMT: A High-Quality and Large-Scale Benchmark Dataset for Vietnamese-English Machine Translation (EMNLP 2021)
40
star
28

single_image_hdr

Single-Image HDR Reconstruction by Multi-Exposure Generation (WACV 2023)
Python
38
star
29

SwiftBrush

SwiftBrush: One-Step Text-to-Image Diffusion Model with Variational Score Distillation (CVPR 2024)
Python
37
star
30

tise-toolbox

TISE: Bag of Metrics for Text-to-Image Synthesis Evaluation (ECCV 2022)
Python
33
star
31

Point-Unet

Point-Unet: A Context-aware Point-based Neural Network for Volumetric Segmentation (MICCAI 2021)
Python
32
star
32

COVID19Tweet

WNUT-2020 Task 2: Identification of informative COVID-19 English Tweets
Python
30
star
33

CREPS

Efficient Scale-Invariant Generator with Column-Row Entangled Pixel Synthesis (CVPR 2023)
Python
30
star
34

ViText2SQL

ViText2SQL: A dataset for Vietnamese Text-to-SQL semantic parsing (EMNLP-2020 Findings)
28
star
35

input-aware-backdoor-attack-release

Input-aware Dynamic Backdoor Attack (NeurIPS 2020)
Python
27
star
36

QC-StyleGAN

QC-StyleGAN - Quality Controllable Image Generation and Manipulation (NeurIPS 2022)
Python
26
star
37

fsvc-ata

Inductive and Transductive Few-Shot Video Classification via Appearance and Temporal Alignments (ECCV 2022)
Python
23
star
38

GeoFormer

Geodesic-Former: a Geodesic-Guided Few-shot 3D Point Cloud Instance Segmenter (ECCV 2022)
Python
23
star
39

PhoST

A High-Quality and Large-Scale Dataset for English-Vietnamese Speech Translation (INTERSPEECH 2022)
19
star
40

MISCA

MISCA: A Joint Model for Multiple Intent Detection and Slot Filling with Intent-Slot Co-Attention (EMNLP 2023 - Findings)
Python
18
star
41

PC3-pytorch

Predictive Coding for Locally-Linear Control (ICML-2020)
Python
16
star
42

Open3DIS

Open3DIS: Open-vocabulary 3D Instance Segmentation with 2D Mask Guidance (CVPR 2024)
Python
16
star
43

EFHQ

Code and data for the CVPR24 paper "EFHQ: Multi-purpose ExtremePose-Face-HQ dataset" [CVPR'24]
Python
15
star
44

TPC-tensorflow

Temporal Predictive Coding For Model-Based Planning In Latent Space (ICML-2021)
Python
14
star
45

iFS-RCNN

iFS-RCNN: An Incremental Few-shot Instance Segmenter (CVPR 2022)
Python
14
star
46

GaPro

GaPro: Box-Supervised 3D Point Cloud Instance Segmentation Using Gaussian Processes as Pseudo Labelers (ICCV 2023)
Python
13
star
47

HyperCUT

HyperCUT: Video Sequence from a Single Blurry Image using Unsupervised Ordering (CVPR'23)
Python
12
star
48

LP-OVOD

LP-OVOD: Open-Vocabulary Object Detection by Linear Probing (WACV 2024)
Python
11
star
49

selfsup_pcd

Self-Supervised Learning with Multi-View Rendering for 3D Point Cloud Analysis (ACCV 2022)
Python
8
star
50

PointSWD

Point-set Distances for Learning Representations of 3D Point Clouds (ICCV 2021)
Python
7
star
51

PhoATIS_Disfluency

From Disfluency Detection to Intent Detection and Slot Filling (INTERSPEECH 2022)
7
star
52

JPIS

JPIS: A Joint Model for Profile-Based Intent Detection and Slot Filling with Slot-to-Intent Attention (ICASSP 2024)
Python
6
star
53

SA-DPM

Official PyTorch implementation of "On Inference Stability for Diffusion Models" (AAAI'24)
Python
5
star
54

PhoDisfluency

Disfluency Detection for Vietnamese (WNUT 2022)
4
star
55

DiverseDream

DiverseDream: A Technique to Generate Diverse 3D Objects from the Same Text Prompt (ECCV '24)
Python
3
star
56

robust-bayesian-recourse

Robust Bayesian Recourse: a robust model-agnostic algorithmic recourse method (UAI'22)
Python
2
star
57

RDUOT

Official code for ECCV 2024 paper β€œA high-quality robust diffusion framework for corrupted dataset”
Python
1
star
58

LAMPAT

LAMPAT: Low-rank Adaptation Multilingual Paraphrasing using Adversarial Training (AAAI'24)
Python
1
star