Face Landmark-based Speaker-Independent Audio-Visual Speech Enhancement in Multi-Talker Environments
Implementation of the audio-visual speech enhancement system described in the paper Face Landmark-based Speaker-Independent Audio-Visual Speech Enhancement in Multi-Talker Environments by University of Modena and Reggio Emilia and Istituto Italiano di Tecnologia.
If you are interested in this work check out the project page.
Getting Started
Install requirements
All code is written for Python 3. Create a virtual environment (optional) and install all the requirements running:
pip install -r requirements.txt
Usage
The main program is av_speech_enhancement.py
. You can get a list of subcommands typing av_speech_enhancement.py -h
. Try av_speech_enhancement.py <subcommand> -h
for more information about a subcommand.
The audio-visual dataset must have the following directory structure:
s1
/audio
/file1.waw
/file2.wav
...
/video
/file1.mpg
/file2.mpg
...
s2
/audio
/file1.wav
/file2.wav
...
/video
/file1.mpg
/file2.mpg
...
...
Mixed-speech generation
Generate mixed-speech for training, validation and test sets separately:
av_speech_enhancement.py mixed_speech_generator
--data_dir <data_dir>
--base_speaker_ids <spk1> <spk2> <...>
[--noisy_speaker_ids <spk1> <spk2> <...>]
--audio_dir <audio_dir>
--dest_dir <dest_dir>
--num_samples <num_samples>
--num_mix <num_mix>
--num_mix_speakers <num_mix_speakers> {1,2}
The generated files are organized as follow:
TRAINING_SET
/s1
/file1_with_s2_file2.wav
/file2_with_s10_file4.wav
...
/s2
/file1_with_s12_file5.wav
/file2_with_s1_file1.wav
...
...
VALIDATION_SET
...
TEST_SET
...
Audio pre-processing
Compute power-law compressed spectrograms of mixed-speech audio samples. Repeat this operation for training, validation and test sets. Files are saved in NPY format.
av_speech_enhancement.py audio_preprocessing
--data_dir <data_dir>
--speaker_ids <spk1> <spk2> <...>
--audio_dir <audio_dir>
--dest_dir <dest_dir>
--sample_rate <sample_rate>
--max_wav_length <max_wav_length>
Video pre-processing
Extract face landmarks from video using Dlib face detector and face landmark extractor. Files are saved in TXT format (each row has 136 values that represents the flattened x-y values of 68 face landmarks).
av_speech_enhancement.py video_preprocessing
--data_dir <data_dir>
--speaker_ids <spk1> <spk2> <...>
--video_dir <video_dir>
--dest_dir <dest_dir>
--shape_predictor <shape_predictor_file>
--ext <video_file_extension>
<shape_predictor_file>
contains the parameters of the face landmark extractor model. You can download a pre-trained model file here.
If you want to check the result of the face landmark extractor type:
av_speech_enhancement.py show_face_landmarks
--video <video_file>
--fps <fps>
--shape_predictor <shape_predictor_file>
Computing Target Binary Masks
Compute TBMs from clean audio samples. For each speaker Long-Term Average Speech Spectrum (LTASS) is computed and then the threshold is applied to all clean audio samples in <audio_dir>
.
av_speech_enhancement.py tbm_computation
--data_dir <data_dir>
--speaker_ids <spk1> <spk2> <...>
--audio_dir <audio_dir>
--dest_dir <dest_dir>
--sample_rate <sample_rate>
--max_wav_length <max_wav_length>
TFRecords generation
Before training you have to generate TFRecords of mixed-speech dataset. <data_dir>/<mix_dir>
must have three subdirectories named TRAINING_SET
, VALIDATION_SET
and TEST_SET
created with <mixed_speech_generator>
subcommand. Pre-computed spectrogram (NPY format) must be located in the same directory of audio file.
Set <tfrecords_mode>
to "fixed" if samples of the dataset all have the same length (as in GRID corpus), otherwise use "var" (as in TCD-TIMIT corpus).
av_speech_enhancement.py tfrecords_generator
--data_dir <data_dir>
--num_speakers <number_speakers_mixed> {2,3}
--mode <tfrecords_mode> {fixed,var}
--dest_dir <dest_dir>
--base_audio_dir <base_audio_dir>
--video_dir <video_dir>
--tbm_dir <tbm_dir>
--mix_audio_dir <mix_audio_dir>
--delta <delta_video_feat> {0,1,2]
--norm_data_dir <normalization_data_dir>
Training
Train an audio-visual speech enhancement model described. You can choose between VL2M, VL2M_ref, Audio-Visual Concat and Audio-Visual Concat-ref models.
av_speech_enhancement.py training
--data_dir <data_dir>
--train_set <training_set_subdir>
--validation_set <validation_set_subdir>
--exp <experiment_id>
--mode <tfrecords_mode> {fixed,var}
--audio_dim <audio_frame_dimension>
--video_dim <video_frame_dimension>
--num_audio_samples <num_audio_samples>
--model <model_selection> {vl2m,vl2m_ref,av_concat_mask,av_concat_mask_ref}
--opt <optimizer_choice> {sgd,adam,momentum}
--learning_rate <learning_rate>
--updating_step <updating_step>
--learning_decay <learning_decay>
--batch_size <batch_size>
--epochs <num_epochs>
--hidden_units <num_hidden_lstm_units>
--layers <num_lstm_layers>
--dropout <dropout_rate>
--regularization <regularization_weight>
Testing
Test your trained model. Enhanced speech samples and estimated masks are saved in <data_dir>/<output_dir>
. Estimated masks are saved in subdirectories <mask_dir>
of each speaker directory.
av_speech_enhancement.py testing
--data_dir <data_dir>
--test_set <training_set_subdir>
--exp <experiment_id>
--ckp <model_checkpoint>
--mode <tfrecords_mode> {fixed,var}
--audio_dim <audio_frame_dimension>
--video_dim <video_frame_dimension>
--num_audio_samples <num_audio_samples>
--output_dir <output_dir>
--mask_dir <mask_dir>
Reference
If this project is useful for your research, please cite:
@inproceedings{morrone2019face,
title={Face Landmark-based Speaker-Independent Audio-Visual Speech Enhancement in Multi-Talker Environments},
author={Morrone, Giovanni and Bergamaschi, Sonia and Pasa, Luca and Fadiga, Luciano and Tikhanoff, Vadim and Badino, Leonardo},
booktitle={2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
pages={6900-6904},
year={2019},
organization={IEEE}
}