• Stars
    star
    405
  • Rank 106,030 (Top 3 %)
  • Language
    Python
  • License
    MIT License
  • Created about 2 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

singing voice change based on whisper, and lora for singing voice clone

Singing Voice Conversion based on Whisper & neural source-filter BigVGAN

GitHub Repo stars GitHub forks GitHub issues GitHub
Black technology based on the three giants of artificial intelligence:

OpenAI's whisper, 680,000 hours in multiple languages

Nvidia's bigvgan, anti-aliasing for speech generation

Microsoft's adapter, high-efficiency for fine-tuning

use pretrain model to fine tune

lora-svc-baker.mp4

Dataset preparation

Necessary pre-processing:

  • 1 accompaniment separation, UVR
  • 2 cut audio, less than 30 seconds for whisper, slicer

then put the dataset into the dataset_raw directory according to the following file structure

dataset_raw
β”œβ”€β”€β”€speaker0
β”‚   β”œβ”€β”€β”€000001.wav
β”‚   β”œβ”€β”€β”€...
β”‚   └───000xxx.wav
└───speaker1
    β”œβ”€β”€β”€000001.wav
    β”œβ”€β”€β”€...
    └───000xxx.wav

Install dependencies

  • 1 software dependency

    apt update && sudo apt install ffmpeg

    pip install -r requirements.txt

  • 2 download the Timbre Encoder: Speaker-Encoder by @mueller91, put best_model.pth.tar into speaker_pretrain/

  • 3 download whisper model multiple language medium model, Make sure to download medium.pt,put it into whisper_pretrain/

    Tip: whisper is built-in, do not install it additionally, it will conflict and report an error

  • 4 download pretrain model maxgan_pretrain_32K.pth, and do test

    python svc_inference.py --config configs/maxgan.yaml --model maxgan_pretrain_32K.pth --spk ./configs/singers/singer0001.npy --wave test.wav

Data preprocessing

  • 0, use this command if you want to automate this:

    python3 prepare/easyprocess.py

  • 1, set working directory:

    export PYTHONPATH=$PWD

  • 2, re-sampling

    generate audio with a sampling rate of 16000Hz:./data_svc/waves-16k

    python prepare/preprocess_a.py -w ./dataset_raw -o ./data_svc/waves-16k -s 16000

    generate audio with a sampling rate of 32000Hz:./data_svc/waves-32k

    python prepare/preprocess_a.py -w ./dataset_raw -o ./data_svc/waves-32k -s 32000

  • 3, use 16K audio to extract pitch:f0_ceil=900, it needs to be modified according to the highest pitch of your data

    python prepare/preprocess_f0.py -w data_svc/waves-16k/ -p data_svc/pitch

    or use next for low quality audio

    python prepare/preprocess_f0_crepe.py -w data_svc/waves-16k/ -p data_svc/pitch

  • 4, use 16K audio to extract ppg

    python prepare/preprocess_ppg.py -w data_svc/waves-16k/ -p data_svc/whisper

  • 5, use 16k audio to extract timbre code

    python prepare/preprocess_speaker.py data_svc/waves-16k/ data_svc/speaker

  • 6, extract the average value of the timbre code for inference; it can also replace a single audio timbre in generating the training index, and use it as the unified timbre of the speaker for training

    python prepare/preprocess_speaker_ave.py data_svc/speaker/ data_svc/singer

  • 7, use 32k audio to generate training index

    python prepare/preprocess_train.py

  • 8, training file debugging

    python prepare/preprocess_zzz.py -c configs/maxgan.yaml

data_svc/
└── waves-16k
β”‚    └── speaker0
β”‚    β”‚      β”œβ”€β”€ 000001.wav
β”‚    β”‚      └── 000xxx.wav
β”‚    └── speaker1
β”‚           β”œβ”€β”€ 000001.wav
β”‚           └── 000xxx.wav
└── waves-32k
β”‚    └── speaker0
β”‚    β”‚      β”œβ”€β”€ 000001.wav
β”‚    β”‚      └── 000xxx.wav
β”‚    └── speaker1
β”‚           β”œβ”€β”€ 000001.wav
β”‚           └── 000xxx.wav
└── pitch
β”‚    └── speaker0
β”‚    β”‚      β”œβ”€β”€ 000001.pit.npy
β”‚    β”‚      └── 000xxx.pit.npy
β”‚    └── speaker1
β”‚           β”œβ”€β”€ 000001.pit.npy
β”‚           └── 000xxx.pit.npy
└── whisper
β”‚    └── speaker0
β”‚    β”‚      β”œβ”€β”€ 000001.ppg.npy
β”‚    β”‚      └── 000xxx.ppg.npy
β”‚    └── speaker1
β”‚           β”œβ”€β”€ 000001.ppg.npy
β”‚           └── 000xxx.ppg.npy
└── speaker
β”‚    └── speaker0
β”‚    β”‚      β”œβ”€β”€ 000001.spk.npy
β”‚    β”‚      └── 000xxx.spk.npy
β”‚    └── speaker1
β”‚           β”œβ”€β”€ 000001.spk.npy
β”‚           └── 000xxx.spk.npy
└── singer
    β”œβ”€β”€ speaker0.spk.npy
    └── speaker1.spk.npy

Train

  • 0, if fine-tuning based on the pre-trained model, you need to download the pre-trained model: maxgan_pretrain_32K.pth

    set pretrain: "./maxgan_pretrain_32K.pth" in configs/maxgan.yaml,and adjust the learning rate appropriately, eg 1e-5

  • 1, set working directory

    export PYTHONPATH=$PWD

  • 2, start training

    python svc_trainer.py -c configs/maxgan.yaml -n svc

  • 3, resume training

    python svc_trainer.py -c configs/maxgan.yaml -n svc -p chkpt/svc/***.pth

  • 4, view log

    tensorboard --logdir logs/

final_model_loss

Inference

  • 0, use this command if you want a GUI that does all the commands below:

    python3 svcgui.py

  • 1, set working directory

    export PYTHONPATH=$PWD

  • 2, export inference model

    python svc_export.py --config configs/maxgan.yaml --checkpoint_path chkpt/svc/***.pt

  • 3, use whisper to extract content encoding, without using one-click reasoning, in order to reduce GPU memory usage

    python whisper/inference.py -w test.wav -p test.ppg.npy

    generate test.ppg.npy; if no ppg file is specified in the next step, generate it automatically

  • 4, extract the F0 parameter to the csv text format, open the csv file in Excel, and manually modify the wrong F0 according to Audition or SonicVisualiser

    python pitch/inference.py -w test.wav -p test.csv

  • 5,specify parameters and infer

    python svc_inference.py --config configs/maxgan.yaml --model maxgan_g.pth --spk ./data_svc/singers/your_singer.npy --wave test.wav --ppg test.ppg.npy --pit test.csv

    when --ppg is specified, when the same audio is reasoned multiple times, it can avoid repeated extraction of audio content codes; if it is not specified, it will be automatically extracted;

    when --pit is specified, the manually tuned F0 parameter can be loaded; if not specified, it will be automatically extracted;

    generate files in the current directory:svc_out.wav

    args --config --model --spk --wave --ppg --pit --shift
    name config path model path speaker wave input wave ppg wave pitch pitch shift

Source of code and References

Adapter-Based Extension of Multi-Speaker Text-to-Speech Model for New Speakers

AdaSpeech: Adaptive Text to Speech for Custom Voice

https://github.com/nii-yamagishilab/project-NN-Pytorch-scripts/tree/master/project/01-nsf

https://github.com/mindslab-ai/univnet [paper]

https://github.com/openai/whisper/ [paper]

https://github.com/NVIDIA/BigVGAN [paper]

Contributor