Speech Separation and Extraction via Deep Learning
This repo summarizes the tutorials, datasets, papers, codes and tools for speech separation and speaker extraction task. You are kindly invited to pull requests.
Table of Contents
Tutorials
-
[Speech Separation, Hung-yi Lee, 2020] [Video (Subtitle)] [Video] [Slide]
-
[Advances in End-to-End Neural Source Separation, Yi Luo, 2020] [Video (BiliBili)] [Video] [Slide]
-
[Audio Source Separation and Speech Enhancement, Emmanuel Vincent, 2018] [Book]
-
[Audio Source Separation, Shoji Makino, 2018] [Book]
-
[Overview Papers] [Paper (Daniel Michelsanti)] [Paper (DeLiang Wang)] [Paper (Bo Xu)] [Paper (Zafar Rafii)] [Paper (Sharon Gannot)]
-
[Overview Slides] [Slide (DeLiang Wang)] [Slide (Haizhou Li)] [Slide (Meng Ge)]
-
[Hand Book] [Ongoing]
Datasets
-
[Dataset Intruduciton] [Pure Speech Dataset Slide (Meng Ge)] [Audio-Visual Dataset Slide (Zexu Pan)]
-
[WSJ0] [Dataset]
-
[WSJ0-2mix] [Script]
-
[WSJ0-2mix-extr] [Script]
-
[WHAM & WHAMR] [Paper (WHAM)] [Paper (WHAMR)] [Dataset]
-
[SparseLibriMix] [Script]
-
[VCTK-2Mix] [Script]
-
[CHIME5 & CHIME6 Challenge] [Dataset]
-
[AudioSet] [Dataset]
-
[Microsoft DNS Challenge] [Dataset]
-
[AVSpeech] [Dataset]
-
[LRW] [Dataset]
-
[LRS2] [Dataset]
-
[VoxCeleb] [Dataset]
Papers
Speech Separation based on Brain Studies
-
[Attentional Selection in a Cocktail Party Environment Can Be Decoded from Single-Trial EEG, James, Cerebral Cortex 2012] [Paper]
-
[Selective cortical representation of attended speaker in multi-talker speech perception, Nima Mesgarani, Nature 2012] [Paper]
-
[Neural decoding of attentional selection in multi-speaker environments without access to clean sources, James, Journal of Neural Engineering 2017] [Paper]
-
[Speech synthesis from neural decoding of spoken sentences, Gopala K. Anumanchipalli, Nature 2019] [Paper]
-
[Towards reconstructing intelligible speech from the human auditory cortex, HassanAkbari, Scientific Reports 2019] [Paper] [Code]
Pure Speech Separation
-
[Joint Optimization of Masks and Deep Recurrent Neural Networks for Monaural Source Separation, Po-Sen Huang, TASLP 2015] [Paper] [Code (posenhuang)]
-
[Complex Ratio Masking for Monaural Speech Separation, DS Williamson, TASLP 2015] [Paper]
-
[Deep clustering: Discriminative embeddings for segmentation and separation, JR Hershey, ICASSP 2016] [Paper] [Code (Kai Li)] [Code (Jian Wu)] [Code (asteroid)]
-
[Single-channel multi-speaker separation using deep clustering, Y Isik, Interspeech 2016] [Paper] [Code (Kai Li)] [Code (Jian Wu)]
-
[Permutation invariant training of deep models for speaker-independent multi-talker speech separation, Dong Yu, ICASSP 2017] [Paper] [Code (Kai Li)] [Code (Sining Sun)]
-
[Recognizing Multi-talker Speech with Permutation Invariant Training, Dong Yu, ICASSP 2017] [Paper]
-
[Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks, M Kolbæk, TASLP 2017] [Paper] [Code (Kai Li)]
-
[Deep attractor network for single-microphone speaker separation, Zhuo Chen, ICASSP 2017] [Paper] [Code (Kai Li)]
-
[Alternative Objective Functions for Deep Clustering, Zhong-Qiu Wang, ICASSP 2018] [Paper]
-
[Listen, Think and Listen Again: Capturing Top-down Auditory Attention for Speaker-independent Speech Separation, Jing Shi, IJCAI 2018] [Paper]
-
[End-to-End Speech Separation with Unfolded Iterative Phase Reconstructioni, Zhong-Qiu Wang et al. 2018] [Paper]
-
[Modeling Attention and Memory for Auditory Selection in a Cocktail Party Environment, Jiaming Xu, AAAI 2018] [Paper] [Code]
-
[Speaker-independent Speech Separation with Deep Attractor Network, Luo Yi, TASLP 2018] [Paper] [Code (Kai Li)]
-
[Listening to Each Speaker One by One with Recurrent Selective Hearing Networks, Keisuke Kinoshita, ICASSP 2018] [Paper]
-
[Tasnet: time-domain audio separation network for real-time, single-channel speech separation, Luo Yi, ICASSP 2018] [Paper] [Code (Kai Li)] [Code (asteroid)]
-
[Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation, Luo Yi, TASLP 2019] [Paper] [Code (Kai Li)] [Code (asteroid)]
-
[Divide and Conquer: A Deep CASA Approach to Talker-independent Monaural Speaker Separation, Yuzhou Liu, TASLP 2019] [Paper] [Code] [Code]
-
[Improved Speech Separation with Time-and-Frequency Cross-domain Joint Embedding and Clustering, Gene-Ping Yang, Interspeech 2019] [Paper] [Code]
-
[Dual-path RNN: efficient long sequence modeling for time-domain single-channel speech separation, Luo Yi, Arxiv 2019] [Paper] [Code (Kai Li)]
-
[A comprehensive study of speech separation: spectrogram vs waveform separation, Fahimeh Bahmaninezhad, Interspeech 2019] [Paper]
-
[Discriminative Learning for Monaural Speech Separation Using Deep Embedding Features, Cunhang Fan, Interspeech 2019] [Paper]
-
[Interrupted and cascaded permutation invariant training for speech separation, Gene-Ping Yang, ICASSP, 2020][Paper]
-
[FurcaNeXt: End-to-end monaural speech separation with dynamic gated dilated temporal convolutional networks, Liwen Zhang, MMM 2020] [Paper]
-
[Filterbank design for end-to-end speech separation, Manuel Pariente et al., ICASSP 2020] [Paper]
-
[Voice Separation with an Unknown Number of Multiple Speakers, Eliya Nachmani, Arxiv 2020] [Paper] [Demo]
-
[AN EMPIRICAL STUDY OF CONV-TASNET, Berkan Kadıoglu , Arxiv 2020] [Paper] [Code]
-
[Voice Separation with an Unknown Number of Multiple Speakers, Eliya Nachmani, Arxiv 2020] [Paper]
-
[Wavesplit: End-to-End Speech Separation by Speaker Clustering, Neil Zeghidour et al. Arxiv 2020 ] [Paper]
-
[La Furca: Iterative Context-Aware End-to-End Monaural Speech Separation Based on Dual-Path Deep Parallel Inter-Intra Bi-LSTM with Attention, Ziqiang Shi, Arxiv 2020] [Paper]
-
[Deep Attention Fusion Feature for Speech Separation with End-to-End Post-filter Method, Cunhang Fan, Arxiv 2020] [Paper]
-
[Identify Speakers in Cocktail Parties with End-to-End Attention, Junzhe Zhu, Arxiv 2018] [Paper] [Code]
-
[Sequence to Multi-Sequence Learning via Conditional Chain Mapping for Mixture Signals, Jing Shi, Arxiv 2020] [Paper] [Code/Demo]
-
[Speaker-Conditional Chain Model for Speech Separation and Extraction, Jing Shi, Arxiv 2020] [Paper] [Code/Demo]
-
[Improving Voice Separation by Incorporating End-to-end Speech Recognition, Naoya Takahashi, ICASSP 2020] [Paper] [Code]
-
[A Multi-Phase Gammatone Filterbank for Speech Separation via TasNet, David Ditter, ICASSP 2020] [Paper] [Code]
-
[Two-Step Sound Source Separation: Training on Learned Latent Targets, Efthymios Tzinis, ICASSP 2020] [Paper] [Code (Asteroid)] [Code (Tzinis)]
-
[Unsupervised Sound Separation Using Mixtures of Mixtures, Scott Wisdom, Arxiv] [Paper]
-
[Speech Separation Based on Multi-Stage Elaborated Dual-Path Deep BiLSTM with Auxiliary Identity Loss, Ziqiang Shi, 2020] [Paper]
Multi-Model Speech Separation
-
[Deep Audio-Visual Learning: A Survey, Hao Zhu, Arxiv 2020] [Paper]
-
[Audio-Visual Speech Enhancement Using Multimodal Deep Convolutional Neural Networks, Jen-Cheng Hou, TETCI 2017] [Paper] [Code]
-
[The Sound of Pixels, Hang Zhao, ECCV 2018] [Paper/Demo]
-
[Learning to Separate Object Sounds by Watching Unlabeled Video, Ruohan Gao, ECCV 2018] [Paper]
-
[The Conversation: Deep Audio-Visual Speech Enhancement, Triantafyllos Afouras, Interspeech 2018] [Paper]
-
[End-to-end audiovisual speech recognition, Stavros Petridis, ICASSP 2018] [Paper] [Code]
-
[Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation, ARIEL EPHRAT, ACM Transactions on Graphics 2018] [Paper] [Code]
-
[Learning to Separate Object Sounds by Watching Unlabeled Video, Ruohan Gao, ECCV 2018] [Paper]
-
[Time domain audio visual speech separation, Jian Wu, Arxiv 2019] [Paper]
-
[Co-Separating Sounds of Visual Objects, Ruohan Gao, ICCV 2019] [Paper]
-
[Recursive Visual Sound Separation Using Minus-Plus Net, Xudong Xu, ICCV 2019] [Paper]
-
[The Sound of Motions, Hang Zhao, ICCV 2019] [Paper]
-
[Audio-Visual Speech Separation and Dereverberation with a Two-Stage Multimodal Network, Ke Tan, Arxiv 2019] [Paper]
-
[Co-Separating Sounds of Visual Objects, Ruohan Gao, ICCV 2019] [Paper] [Code]
-
[Face Landmark-based Speaker-Independent Audio-Visual Speech Enhancement in Multi-Talker Environments, Giovanni Morrone, Arxiv 2019] [Paper] [Code]
-
[Music Gesture for Visual Sound Separation, Chuang Gao, CVPR 2020] [Paper]
-
[FaceFilter: Audio-visual speech separation using still images, Soo-Whan Chung, Arxiv 2020] [Paper]
-
[Awesome Audio-Visual, Github, Kranti Kumar Parida] [Github Link]
Multi-channel Speech Separation
-
[FaSNet: Low-latency Adaptive Beamforming for Multi-microphone Audio Processing, Yi Luo , Arxiv 2019] [Paper]
-
[MIMO-SPEECH: End-to-End Multi-Channel Multi-Speaker Speech Recognition, Xuankai Chang et al., ASRU 2020] [Paper]
-
[End-to-end Microphone Permutation and Number Invariant Multi-channel Speech Separation, Yi Luo et al., ICASSP 2020] [Paper] [Code]
-
[Enhancing End-to-End Multi-channel Speech Separation via Spatial Feature Learning, Rongzhi Guo, ICASSP 2020] [Paper]
-
[Multi-modal Multi-channel Target Speech Separation, Rongzhi Guo, J-STSP 2020] [Paper]
Speaker Extraction
-
[Single channel target speaker extraction and recognition with speaker beam, Marc Delcroix, ICASSP 2018] [Paper]
-
[VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking, Quan Wang, INTERSPEECH 2018] [Paper] [Code (Jian Wu)]
-
[Single-Channel Speech Extraction Using Speaker Inventory and Attention Network, Xiong Xiao et al, ICASSP 2019] [Paper]
-
[Optimization of Speaker Extraction Neural Network with Magnitude and Temporal Spectrum Approximation Loss, Chenglin Xu, ICASSP 2019] [Paper] [Code]
-
[Time-domain speaker extraction network, Chenglin Xu, ASRU 2019] [Paper]
-
[SpEx: Multi-Scale Time Domain Speaker Extraction Network, Chenglin Xu, TASLP 2020] [Paper]
-
[Improving speaker discrimination of target speech extraction with time-domain SpeakerBeam, Marc Delcroix, ICASSP 2020] [Paper]
-
[SpEx+: A Complete Time Domain Speaker Extraction Network, Meng Ge, Arxiv 2020] [Paper] [Code]
Tools
System Tools
- [Asteroid: the PyTorch-based audio source separation toolkit for researchers, Manuel Pariente et al., ICASSP 2020] [Tool Link]
- [ESPnet-se: end-to-end speech enhancement and separation toolkit designed for asr integration, Chenda Li et al., arxiv] [Paper Link]
Evaluation Tools
-
[Performance measurement in blind audio sourceseparation, Emmanuel Vincent et al., TASLP 2004] [Paper] [Tool Link]
-
[SDR – Half-baked or Well Done?, Jonathan Le Roux, ICASSP 2019] [Paper] [Tool Link]
Results on WSJ0-2mix
Speech separation (SS) and speaker extraction (SE) on the WSJ0-2mix (8k, min) dataset.
Task | Methods | Model Size | SDRi | SI-SDRi |
---|---|---|---|---|
SS | DPCL++ | 13.6M | - | 10.8 |
SS | uPIT-BLSTM-ST | 92.7M | 10.0 | - |
SS | DANet | 9.1M | - | 10.5 |
SS | cuPIT-Grid-RD | 53.2M | 10.2 | - |
SS | SDC-G-MTL | 53.9M | 10.5 | - |
SS | CBLDNN-GAT | 39.5M | 11.0 | - |
SS | Chimera++ | 32.9M | 12.0 | 11.5 |
SS | WA-MISI-5 | 32.9M | 13.1 | 12.6 |
SS | BLSTM-TasNet | 23.6M | 13.6 | 13.2 |
SS | Conv-TasNet | 5.1M | 15.6 | 15.3 |
SE | SpEx | 10.8M | 17.0 | 16.6 |
SE | SpEx+ | 11.1M | 17.6 | 17.4 |
SS | DeepCASA | 12.8M | 18.0 | 17.7 |
SS | FurcaNeXt | 51.4M | 18.4 | - |
SS | DPRNN-TasNet | 2.6M | 19.0 | 18.8 |
SS | Wavesplit | - | 19.2 | 19.0 |
SS | Wavesplit + Dynamic mixing | - | 20.6 | 20.4 |