Black technology based on the three giants of artificial intelligence:
OpenAI's whisper, 680,000 hours in multiple languages
Nvidia's bigvgan, anti-aliasing for speech generation
Microsoft's adapter, high-efficiency for fine-tuning
use pretrain model to fine tune
lora-svc-baker.mp4
Dataset preparation
Necessary pre-processing:
then put the dataset into the dataset_raw directory according to the following file structure
dataset_raw
โโโโspeaker0
โ โโโโ000001.wav
โ โโโโ...
โ โโโโ000xxx.wav
โโโโspeaker1
โโโโ000001.wav
โโโโ...
โโโโ000xxx.wav
Install dependencies
-
1 software dependency
apt update && sudo apt install ffmpeg
pip install -r requirements.txt
-
2 download the Timbre Encoder: Speaker-Encoder by @mueller91, put
best_model.pth.tar
intospeaker_pretrain/
-
3 download whisper model multiple language medium model, Make sure to download
medium.pt
๏ผput it intowhisper_pretrain/
Tip: whisper is built-in, do not install it additionally, it will conflict and report an error
-
4 download pretrain model maxgan_pretrain_32K.pth, and do test
python svc_inference.py --config configs/maxgan.yaml --model maxgan_pretrain_32K.pth --spk ./configs/singers/singer0001.npy --wave test.wav
Data preprocessing
-
0, use this command if you want to automate this:
python3 prepare/easyprocess.py
-
1๏ผ set working directory:
export PYTHONPATH=$PWD
-
2๏ผ re-sampling
generate audio with a sampling rate of 16000Hz๏ผ./data_svc/waves-16k
python prepare/preprocess_a.py -w ./dataset_raw -o ./data_svc/waves-16k -s 16000
generate audio with a sampling rate of 32000Hz๏ผ./data_svc/waves-32k
python prepare/preprocess_a.py -w ./dataset_raw -o ./data_svc/waves-32k -s 32000
-
3๏ผ use 16K audio to extract pitch๏ผf0_ceil=900, it needs to be modified according to the highest pitch of your data
python prepare/preprocess_f0.py -w data_svc/waves-16k/ -p data_svc/pitch
or use next for low quality audio
python prepare/preprocess_f0_crepe.py -w data_svc/waves-16k/ -p data_svc/pitch
-
4๏ผ use 16K audio to extract ppg
python prepare/preprocess_ppg.py -w data_svc/waves-16k/ -p data_svc/whisper
-
5๏ผ use 16k audio to extract timbre code
python prepare/preprocess_speaker.py data_svc/waves-16k/ data_svc/speaker
-
6๏ผ extract the average value of the timbre code for inference; it can also replace a single audio timbre in generating the training index, and use it as the unified timbre of the speaker for training
python prepare/preprocess_speaker_ave.py data_svc/speaker/ data_svc/singer
-
7๏ผ use 32k audio to generate training index
python prepare/preprocess_train.py
-
8๏ผ training file debugging
python prepare/preprocess_zzz.py -c configs/maxgan.yaml
data_svc/
โโโ waves-16k
โ โโโ speaker0
โ โ โโโ 000001.wav
โ โ โโโ 000xxx.wav
โ โโโ speaker1
โ โโโ 000001.wav
โ โโโ 000xxx.wav
โโโ waves-32k
โ โโโ speaker0
โ โ โโโ 000001.wav
โ โ โโโ 000xxx.wav
โ โโโ speaker1
โ โโโ 000001.wav
โ โโโ 000xxx.wav
โโโ pitch
โ โโโ speaker0
โ โ โโโ 000001.pit.npy
โ โ โโโ 000xxx.pit.npy
โ โโโ speaker1
โ โโโ 000001.pit.npy
โ โโโ 000xxx.pit.npy
โโโ whisper
โ โโโ speaker0
โ โ โโโ 000001.ppg.npy
โ โ โโโ 000xxx.ppg.npy
โ โโโ speaker1
โ โโโ 000001.ppg.npy
โ โโโ 000xxx.ppg.npy
โโโ speaker
โ โโโ speaker0
โ โ โโโ 000001.spk.npy
โ โ โโโ 000xxx.spk.npy
โ โโโ speaker1
โ โโโ 000001.spk.npy
โ โโโ 000xxx.spk.npy
โโโ singer
โโโ speaker0.spk.npy
โโโ speaker1.spk.npy
Train
-
0๏ผ if fine-tuning based on the pre-trained model, you need to download the pre-trained model: maxgan_pretrain_32K.pth
set pretrain: "./maxgan_pretrain_32K.pth" in configs/maxgan.yaml๏ผand adjust the learning rate appropriately, eg 1e-5
-
1๏ผ set working directory
export PYTHONPATH=$PWD
-
2๏ผ start training
python svc_trainer.py -c configs/maxgan.yaml -n svc
-
3๏ผ resume training
python svc_trainer.py -c configs/maxgan.yaml -n svc -p chkpt/svc/***.pth
-
4๏ผ view log
tensorboard --logdir logs/
Inference
-
0, use this command if you want a GUI that does all the commands below:
python3 svcgui.py
-
1๏ผ set working directory
export PYTHONPATH=$PWD
-
2๏ผ export inference model
python svc_export.py --config configs/maxgan.yaml --checkpoint_path chkpt/svc/***.pt
-
3๏ผ use whisper to extract content encoding, without using one-click reasoning, in order to reduce GPU memory usage
python whisper/inference.py -w test.wav -p test.ppg.npy
generate test.ppg.npy; if no ppg file is specified in the next step, generate it automatically
-
4๏ผ extract the F0 parameter to the csv text format, open the csv file in Excel, and manually modify the wrong F0 according to Audition or SonicVisualiser
python pitch/inference.py -w test.wav -p test.csv
-
5๏ผspecify parameters and infer
python svc_inference.py --config configs/maxgan.yaml --model maxgan_g.pth --spk ./data_svc/singers/your_singer.npy --wave test.wav --ppg test.ppg.npy --pit test.csv
when --ppg is specified, when the same audio is reasoned multiple times, it can avoid repeated extraction of audio content codes; if it is not specified, it will be automatically extracted;
when --pit is specified, the manually tuned F0 parameter can be loaded; if not specified, it will be automatically extracted;
generate files in the current directory:svc_out.wav
args --config --model --spk --wave --ppg --pit --shift name config path model path speaker wave input wave ppg wave pitch pitch shift
Source of code and References
Adapter-Based Extension of Multi-Speaker Text-to-Speech Model for New Speakers
AdaSpeech: Adaptive Text to Speech for Custom Voice
https://github.com/nii-yamagishilab/project-NN-Pytorch-scripts/tree/master/project/01-nsf
https://github.com/mindslab-ai/univnet [paper]
https://github.com/openai/whisper/ [paper]
https://github.com/NVIDIA/BigVGAN [paper]