TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation
Rongjie Huang*, Jinglin Liu*, Huadai Liu*, Yi Ren, Lichao Zhang, Jinzheng He, Zhou Zhao | Zhejiang University, ByteDance
PyTorch Implementation of TranSpeech (ICLR'23): a speech-to-speech translation model towards high-accuracy and non-autoregressive translation.
We provide our implementation and pretrained models in this repository.
Visit our demo page for audio samples.
🔥 News
TranSpeech is one of our continuous efforts to reduce communication barrier.
- July, 2022: TranSpeech released at Arxiv.
- March, 2023: TranSpeech (ICLR 2023) released at Github.
- March, 2023: Audio-Visual Speech-To-Text Translation may also interest you.
Dependencies
- PyTorch version >= 1.5.0
- Python version >= 3.6
- For training new models, you'll also need an NVIDIA GPU and NCCL
- To install fairseq version 1.0.0a0 and develop locally:
pip install --editable ./
- For faster training install NVIDIA's apex library:
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" \
--global-option="--deprecated_fused_adam" --global-option="--xentropy" \
--global-option="--fast_multihead_attn" ./
Train your own model
Data preparation
- Prepare two folders,
$SRC_AUDIO
and$TGT_AUDIO
, with${SPLIT}/${SAMPLE_ID}.wav
for source and target speech under each folder, separately. Note that for S2UT experiments, target audio sampling rate should be in 16,000 Hz, and for S2SPECT experiments, target audio sampling rate is recommended to be in 22,050 Hz. - To prepare target discrete units for S2UT model training, see Generative Spoken Language Modeling (speech2unit) for pre-trained k-means models, checkpoints, and instructions on how to decode units from speech. Set the output target unit files (
--out_quantized_file_path
) as${TGT_AUDIO}/${SPLIT}.txt
. In Lee et al. 2021, we use 100 units from the sixth layer (--layer 6
) of the HuBERT Base model.
Bilateral Perturbation
1. Prepare a pretrained Hubert and HifiGAN
Model | Pretraining Data | Model | Quantizer |
---|---|---|---|
mHuBERT Base | En, Es, Fr speech | download | L11 km1000 |
HIFIGAN | 16k Universal | download | |
dict.unit.txt | download |
2. Prepare Perturbated Dataset
Suppose we have original dataset at /path/to/TGT_AUDIO
- information enhancement: refer to
./hubertCTC/gen_IE.py
and generate Dataset S1
python research/TranSpeech/hubertCTC/gen_IE.py --ckpt /path/to/ckpt --wav /path/to/TGT_AUDIO --out /path/to/S2/dataset
- style normalization: refer to
./hubertCTC/gen_SN.py
and generate Dataset S2:
python research/TranSpeech/hubertCTC/gen_SN.py --wav /path/to/TGT_AUDIO --out /path/to/S1/dataset
3. Prepare Pseudo Text
- Get Manifest
python examples/wav2vec/wav2vec_manifest.py /path/to/S1/dataset --dest /manifest/to/S1/dataset --ext $ext --valid-percent $valid
python examples/wav2vec/wav2vec_manifest.py /path/to/S2/dataset --dest /manifest/to/S2/dataset --ext $ext --valid-percent $valid
$ext should be set to flac, wav, or whatever format your dataset happens to use that soundfile can read. $valid should be set to some reasonable percentage (like 0.01) of training data to use for validation.
- Quantize using the learned clusters
MANIFEST=/manifest/to/S2/dataset
OUT_QUANTIZED_FILE=/quantized/to/S2/dataset
For CKPT_PATH & KM_MODEL_PATH, refer to Section 1.
python examples/textless_nlp/gslm/speech2unit/clustering/quantize_with_kmeans.py \
--feature_type hubert \
--kmeans_model_path $KM_MODEL_PATH \
--acoustic_model_path $CKPT_PATH \
--layer 11 \
--manifest_path $MANIFEST \
--out_quantized_file_path $OUT_QUANTIZED_FILE \
--extension ".flac"
- Prepare {train,valid}.unit
python data/huberts/generate_tunehuberts.py --manifest /manifest/to/S2/dataset --txt /quantized/to/S2/dataset --unit /unit/to/S2/dataset
4. Fine-tune a HuBERT model with a CTC loss
Suppose we have a mHuBERT Base ckpt at /path/to/checkpoint
Suppose {train,valid}.tsv are saved at /manifest/to/S1/dataset
, and their corresponding character transcripts {train,valid}.unit and dict.unit.txt are saved at /unit/to/S2/dataset
.
- To fine-tune a pre-trained HuBERT model at /path/to/checkpoint, run
$ python fairseq_cli/hydra_train.py \
--config-dir /path/to/fairseq-py/examples/hubert/config/finetune \
--config-name base_10h_change \
task.data=/manifest/to/S1/dataset task.label_dir=/unit/to/S2/dataset \
model.w2v_path=/path/to/checkpoint optimization.max_update=70000
Prepare data for training S2UT model
1. Inference with Tuned Huberts
- Format the audio data.
AUDIO_EXT: audio extension, e.g. wav, flac, etc.
Assume all audio files are at ${AUDIO_DIR}/*.${AUDIO_EXT}
${GEN_SUBSET} should be train, test, or dev
python examples/speech_to_speech/preprocessing/prep_sn_data.py \
--audio-dir /path/to/TGT_AUDIO --ext ${AUIDO_EXT} \
--data-name ${GEN_SUBSET} --output-dir ${DATA_DIR} \
--for-inference
- Run the Tuned Huberts.
mkdir -p ${RESULTS_PATH}
python examples/speech_recognition/new/infer.py \
--config-dir examples/hubert/config/decode/ \
--config-name infer_viterbi \
task.data=${DATA_DIR} \
task.normalize=false \
common_eval.results_path=${RESULTS_PATH}/log \
common_eval.path=${DATA_DIR}/checkpoint_best.pt \
dataset.gen_subset=${GEN_SUBSET} \
'+task.labels=["unit"]' \
+decoding.results_path=${RESULTS_PATH} \
common_eval.post_process=none \
+dataset.batch_size=1 \
common_eval.quiet=True
- Post-process and generate output at
${RESULTS_PATH}/${GEN_SUBSET}.txt
python examples/speech_to_speech/preprocessing/prep_sn_output_data.py \
--in-unit ${RESULTS_PATH}/hypo.units \
--in-audio ${DATA_DIR}/${GEN_SUBSET}.tsv \
--output-root ${RESULTS_PATH}
2. Formatting Speech-to-Speech Translation data
# $SPLIT1, $SPLIT2, etc. are split names such as train, dev, test, etc.
python examples/speech_to_speech/preprocessing/prep_s2ut_data.py \
--source-dir $SRC_AUDIO --target-dir $TGT_AUDIO --data-split $SPLIT1 $SPLIT2 \
--output-root $DATA_ROOT --reduce-unit \
--vocoder-checkpoint $VOCODER_CKPT --vocoder-cfg $VOCODER_CFG
For knowledge distillation, we need another step to format the data from teacher.
Training S2UT model
Here's an example for training nar_s2ut_conformer S2UT models with 1000 discrete units as target:
fairseq-train $DATA_ROOT \
--config-yaml config.yaml \
--task speech_to_speech_fasttranslate --target-is-code --target-code-size 1000 --vocoder code_hifigan \
--criterion nar_speech_to_unit --label-smoothing 0.2 \
--arch nar_s2ut_conformer --share-decoder-input-output-embed \
--dropout 0.1 --attention-dropout 0.1 --relu-dropout 0.1 \
--train-subset train --valid-subset dev \
--save-dir ${MODEL_DIR} --tensorboard-logdir ${MODEL_DIR} \
--lr 0.0005 --lr-scheduler inverse_sqrt --warmup-init-lr 1e-7 --warmup-updates 10000 \
--optimizer adam --adam-betas "(0.9,0.98)" --clip-norm 10.0 \
--max-update 400000 --max-tokens 20000 --max-target-positions 3000 --update-freq 4 \
--seed 1 --fp16 --num-workers 8 \
--user-dir research/ --attn-type espnet --pos-enc-type rel_pos
- Adjust
--update-freq
accordingly for different #GPUs. In the above we set--update-freq 4
to simulate training with 4 GPUs.
Inference with NAR S2UT model
- Follow the same inference process as in fairseq-S2T to generate unit sequences (${RESULTS_PATH}/generate-${GEN_SUBSET}.txt).
fairseq-generate $DATA_ROOT \
--gen-subset test --task speech_to_speech_fasttranslate --path ${MODEL_DIR} \
--target-is-code --target-code-size 1000 --vocoder code_hifigan --results-path ${OUTPUT_DIR} \
--iter-decode-max-iter $N --iter-decode-eos-penalty 0 --beam 1 --iter-decode-with-beam 15
- Noisy decoding: inference with
--external-reranker --path ${checkpoint_path} = a:b
, wherea, b
denote the student and AR tracher.
- Convert unit sequences to waveform.
grep "^D\-" ${RESULTS_PATH}/generate-${GEN_SUBSET}.txt | \
sed 's/^D-//ig' | sort -nk1 | cut -f3 \
> ${RESULTS_PATH}/generate-${GEN_SUBSET}.unit
Unit-to-Speech HiFi-GAN vocoder
Unit config | Unit size | Vocoder language | Dataset | Model |
---|---|---|---|---|
mHuBERT, layer 11 | 1000 | En | LJSpeech | ckpt, config |
mHuBERT, layer 11 | 1000 | Es | CSS10 | ckpt, config |
mHuBERT, layer 11 | 1000 | Fr | CSS10 | ckpt, config |
python examples/speech_to_speech/generate_waveform_from_code.py \
--in-code-file ${RESULTS_PATH}/generate-${GEN_SUBSET}.unit \
--vocoder $VOCODER_CKPT --vocoder-cfg $VOCODER_CFG \
--results-path ${RESULTS_PATH} --dur-prediction
Evaluation
Refer to research/TranSpeech/asr_bleu/README.md
Acknowledgements
This implementation uses parts of the code from the following Github repos: Fairseq, as described in our code.
Citations
If you find this code useful in your research, please cite our work:
@article{huang2022transpeech,
title={TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation},
author={Huang, Rongjie and Zhao, Zhou and Liu, Jinglin and Liu, Huadai and Ren, Yi and Zhang, Lichao and He, Jinzheng},
journal={arXiv preprint arXiv:2205.12523},
year={2022}
}
Disclaimer
Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.