3D-Speaker is an open-source toolkit for single- and multi-modal speaker verification, speaker recognition, and speaker diarization. All pretrained models are accessible on ModelScope. Furthermore, we present a large-scale speech corpus also called 3D-Speaker to facilitate the research of speech representation disentanglement.
Quickstart
Install 3D-Speaker
git clone https://github.com/alibaba-damo-academy/3D-Speaker.git && cd 3D-Speaker
conda create -n 3D-Speaker python=3.8
conda activate 3D-Speaker
pip install -r requirements.txt
Running experiments
# Speaker verification: ERes2Net on 3D Speaker
cd egs/3dspeaker/sv-eres2net/
bash run.sh
# Speaker verification: CAM++ on 3D Speaker
cd egs/3dspeaker/sv-cam++/
bash run.sh
# Self-supervised speaker verification: RDINO on 3D Speaker
cd egs/3dspeaker/sv-rdino/
bash run.sh
# Speaker diarization:
cd egs/3dspeaker/speaker-diarization/
bash run.sh
# Language identification
cd egs/3dspeaker/language-idenitfication
bash run.sh
Inference using pretrained models from Modelscope
All pretrained models are released on Modelscope.
# Install modelscope
pip install modelscope
# CAM++ trained on VoxCeleb
model_id=damo/speech_campplus_sv_en_voxceleb_16k
# CAM++ trained on 200k labeled speakers
model_id=damo/speech_campplus_sv_zh-cn_16k-common
# ERes2Net trained on VoxCeleb
model_id=damo/speech_eres2net_sv_en_voxceleb_16k
# ERes2Net trained on 200k labeled speakers
model_id=damo/speech_eres2net_sv_zh-cn_16k-common
# Run CAM++ or ERes2Net inference
python speakerlab/bin/infer_sv.py --model_id $model_id --wavs $wav_path
# RDINO trained on VoxCeleb
model_id=damo/speech_rdino_ecapa_tdnn_sv_en_voxceleb_16k
# Run rdino inference
python speakerlab/bin/infer_sv_rdino.py --model_id $model_id --wavs $wav_path
Overview of Content
-
Supervised Speaker Verification
-
Self-supervised Speaker Verification
- RDINO training recipes on VoxCeleb
- RDINO training recipes on 3D-Speaker.
-
Speaker Diarization
- Speaker diarization inference recipes which comprise multiple modules, including voice activity detection, speech segmentation, speaker embedding extraction, and speaker clustering.
-
Language Identification
- Language identification training recipes on 3D-Speaker.
-
3D-Speaker Dataset
- Dataset introduction and download address: 3D-Speaker
- Related paper address: 3D-Speaker
- Dataset introduction and download address: 3D-Speaker
Whatβs new π₯
- [2023.8] Releasing CAM++, ERes2Net-Base and ERes2Net-Large benchmarks in CN-Celeb.
- [2023.8] Releasing ERes2Net annd CAM++ in language identification for Mandarin and English.
- [2023.7] Releasing CAM++, ERes2Net-Base, ERes2Net-Large pretrained models trained on 3D-Speaker.
- [2023.7] Releasing Dialogue Detection and Semantic Speaker Change Detection in speaker diarization.
- [2023.7] Releasing CAM++ in language identification for Mandarin and English.
- [2023.6] Releasing 3D-Speaker dataset and its corresponding benchmarks including ERes2Net, CAM++ and RDINO.
- [2023.5] ERes2Net pretrained model released, trained on a Mandarin dataset of 200k labeled speakers.
- [2023.4] CAM++ pretrained model released, trained on a Mandarin dataset of 200k labeled speakers.
To be expected π₯
- [2023.9] Releasing score normalization and large-margin finetune recipes in speaker verification.
- [2023.9] Releasing ECAPA model training and inference recipes for three datasets.
- [2023.9] Releasing RDINO model training and inference recipes for CN-Celeb.
Contact
If you have any comment or question about 3D-Speaker, please contact us by
- email: {chenyafeng.cyf, zsq174630, tongmu.wh, shuli.cly}@alibaba-inc.com
License
3D-Speaker is released under the Apache License 2.0.
Acknowledge
3D-Speaker contains third-party components and code modified from some open-source repos, including:
Speechbrain, Wespeaker, D-TDNN, DINO, Vicreg
Citations
If you find this repository useful, please consider giving a star β and citation π¦:
@inproceedings{zheng20233d,
title={3D-Speaker: A Large-Scale Multi-Device, Multi-Distance, and Multi-Dialect Corpus for Speech Representation Disentanglement},
author={Siqi Zheng, Luyao Cheng, Yafeng Chen, Hui Wang and Qian Chen},
url={https://arxiv.org/pdf/2306.15354.pdf},
year={2023}
}
@inproceedings{chen2023ensemble,
title={SELF-DISTILLATION NETWORK WITH ENSEMBLE PROTOTYPES: LEARNING ROBUST SPEAKER REPRESENTATIONS WITHOUT SUPERVISION},
author={Yafeng Chen, Siqi Zheng, Hui Wang, Luyao Cheng, Qian Chen and Shiliang Zhang},
url={https://arxiv.org/pdf/2308.02774.pdf},
year={2023}
}
@inproceedings{wang2023cam++,
title={CAM++: A Fast and Efficient Network For Speaker Verification Using Context-Aware Masking},
author={Wang, Hui and Zheng, Siqi and Chen, Yafeng and Cheng, Luyao and Chen, Qian},
year={2023},
booktitle={INTERSPEECH}
}
@inproceedings{chen2023enhanced,
title={An Enhanced Res2Net with Local and Global Feature Fusion for Speaker Verification},
author={Chen, Yafeng and Zheng, Siqi and Wang, Hui and Cheng, Luyao and Chen, Qian and Qi, Jiajun},
year={2023},
booktitle={INTERSPEECH}
}
@inproceedings{chen2023pushing,
title={Pushing the limits of self-supervised speaker verification using regularized distillation framework},
author={Chen, Yafeng and Zheng, Siqi and Wang, Hui and Cheng, Luyao and Chen, Qian},
booktitle={ICASSP 2023},
year={2023}
}