• Stars
    star
    314
  • Rank 132,556 (Top 3 %)
  • Language
    Python
  • License
    Other
  • Created almost 3 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Visual Speech Recognition for Multiple Languages

logo

Visual Speech Recognition for Multiple Languages

Authors

Pingchuan Ma, Alexandros Haliassos, Adriana Fernandez-Lopez, Honglie Chen, Stavros Petridis, Maja Pantic.

Update

2023-03-27: We have released our AutoAVSR models for LRS3, see here.

Introduction

This is the repository of Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels and Visual Speech Recognition for Multiple Languages, which is the successor of End-to-End Audio-Visual Speech Recognition with Conformers. By using this repository, you can achieve the performance of 19.1%, 1.0% and 0.9% WER for automatic, visual, and audio-visual speech recognition (ASR, VSR, and AV-ASR) on LRS3.

Tutorial

We provide a tutorial Open In Colab to show how to use our Auto-AVSR models to perform speech recognition (ASR, VSR, and AV-ASR), crop mouth ROIs or extract visual speech features.

Demo

English -> Mandarin -> Spanish French -> Portuguese -> Italian

Preparation

  1. Clone the repository and enter it locally:
git clone https://github.com/mpc001/Visual_Speech_Recognition_for_Multiple_Languages
cd Visual_Speech_Recognition_for_Multiple_Languages
  1. Setup the environment.
conda create -y -n autoavsr python=3.8
conda activate autoavsr
  1. Install pytorch, torchvision, and torchaudio by following instructions here, and install all packages:
pip install -r requirements.txt
conda install -c conda-forge ffmpeg
  1. Download and extract a pre-trained model and/or language model from model zoo to:
  • ./benchmarks/${dataset}/models

  • ./benchmarks/${dataset}/language_models

  1. [For VSR and AV-ASR] Install RetinaFace or MediaPipe tracker.

Benchmark evaluation

python eval.py config_filename=[config_filename] \
               labels_filename=[labels_filename] \
               data_dir=[data_dir] \
               landmarks_dir=[landmarks_dir]
  • [config_filename] is the model configuration path, located in ./configs.

  • [labels_filename] is the labels path, located in ${lipreading_root}/benchmarks/${dataset}/labels.

  • [data_dir] and [landmarks_dir] are the directories for original dataset and corresponding landmarks.

  • gpu_idx=-1 can be added to switch from cuda:0 to cpu.

Speech prediction

python infer.py config_filename=[config_filename] data_filename=[data_filename]
  • data_filename is the path to the audio/video file.

  • detector=mediapipe can be added to switch from RetinaFace to MediaPipe tracker.

Mouth ROIs cropping

python crop_mouth.py data_filename=[data_filename] dst_filename=[dst_filename]
  • dst_filename is the path where the cropped mouth will be saved.

Model zoo

Overview

We support a number of datasets for speech recognition:

AutoAVSR models

Lip Reading Sentences 3 (LRS3)

Components WER url size (MB)
Visual-only
- 19.1 GoogleDrive or BaiduDrive(key: dqsy) 891
Audio-only
- 1.0 GoogleDrive or BaiduDrive(key: dvf2) 860
Audio-visual
- 0.9 GoogleDrive or BaiduDrive(key: sai5) 1540
Language models
- - GoogleDrive or BaiduDrive(key: t9ep) 191
Landmarks
- - GoogleDrive or BaiduDrive(key: mi3c) 18577

VSR for multiple languages models

Lip Reading Sentences 2 (LRS2)

Components WER url size (MB)
Visual-only
- 26.1 GoogleDrive or BaiduDrive(key: 48l1) 186
Language models
- - GoogleDrive or BaiduDrive(key: 59u2) 180
Landmarks
- - GoogleDrive or BaiduDrive(key: 53rc) 9358
Lip Reading Sentences 3 (LRS3)

Components WER url size (MB)
Visual-only
- 32.3 GoogleDrive or BaiduDrive(key: 1b1s) 186
Language models
- - GoogleDrive or BaiduDrive(key: 59u2) 180
Landmarks
- - GoogleDrive or BaiduDrive(key: mi3c) 18577
Chinese Mandarin Lip Reading (CMLR)

Components CER url size (MB)
Visual-only
- 8.0 GoogleDrive or BaiduDrive(key: 7eq1) 195
Language models
- - GoogleDrive or BaiduDrive(key: k8iv) 187
Landmarks
- - GoogleDrive or BaiduDrive(key: 1ret) 3721
CMU Multimodal Opinion Sentiment, Emotions and Attributes (CMU-MOSEAS)

Components WER url size (MB)
Visual-only
Spanish 44.5 GoogleDrive or BaiduDrive(key: m35h) 186
Portuguese 51.4 GoogleDrive or BaiduDrive(key: wk2h) 186
French 58.6 GoogleDrive or BaiduDrive(key: t1hf) 186
Language models
Spanish - GoogleDrive or BaiduDrive(key: 0mii) 180
Portuguese - GoogleDrive or BaiduDrive(key: l6ag) 179
French - GoogleDrive or BaiduDrive(key: 6tan) 179
Landmarks
- - GoogleDrive or BaiduDrive(key: vsic) 3040
GRID

Components WER url size (MB)
Visual-only
Overlapped 1.2 GoogleDrive or BaiduDrive(key: d8d2) 186
Unseen 4.8 GoogleDrive or BaiduDrive(key: ttsh) 186
Landmarks
- - GoogleDrive or BaiduDrive(key: 16l9) 1141

You can include data_ext=.mpg in your command line to match the video file extension in the GRID dataset.

Lombard GRID

Components WER url size (MB)
Visual-only
Unseen (Front Plain) 4.9 GoogleDrive or BaiduDrive(key: 38ds) 186
Unseen (Side Plain) 8.0 GoogleDrive or BaiduDrive(key: k6m0) 186
Landmarks
- - GoogleDrive or BaiduDrive(key: cusv) 309

You can include data_ext=.mov in your command line to match the video file extension in the Lombard GRID dataset.

TCD-TIMIT

Components WER url size (MB)
Visual-only
Overlapped 16.9 GoogleDrive or BaiduDrive(key: jh65) 186
Unseen 21.8 GoogleDrive or BaiduDrive(key: n2gr) 186
Language models
- - GoogleDrive or BaiduDrive(key: 59u2) 180
Landmarks
- - GoogleDrive or BaiduDrive(key: bnm8) 930

Citation

If you use the AutoAVSR models, please consider citing the following paper:

@inproceedings{ma2023auto,
  author={Ma, Pingchuan and Haliassos, Alexandros and Fernandez-Lopez, Adriana and Chen, Honglie and Petridis, Stavros and Pantic, Maja},
  booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  title={Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels}, 
  year={2023},
}

If you use the VSR models for multiple languages please consider citing the following paper:

@article{ma2022visual,
  title={{Visual Speech Recognition for Multiple Languages in the Wild}},
  author={Ma, Pingchuan and Petridis, Stavros and Pantic, Maja},
  journal={{Nature Machine Intelligence}},
  volume={4},
  pages={930--939},
  year={2022}
  url={https://doi.org/10.1038/s42256-022-00550-z},
  doi={10.1038/s42256-022-00550-z}
}

License

It is noted that the code can only be used for comparative or benchmarking purposes. Users can only use code supplied under a License for non-commercial purposes.

Contact

[Pingchuan Ma](pingchuan.ma16[at]imperial.ac.uk)