Learning the Beauty in Songs: Neural Singing Voice Beautifier
This repository is the official PyTorch implementation of our ACL-2022 paper.
0. Dataset (PopBuTFy) Acquirement
Audio samples
- See in apply_form.
- Dataset preview.
Text labels
NeuralSVB does not need text as input, but the ASR model to extract PPG needs text. Thus we also provide the text labels of PopBuTFy.
1. Preparation
Environment Preparation
WIP.
Data Preparation
- Extract embeddings of vocal timbre:
CUDA_VISIBLE_DEVICES=0 python data_gen/tts/bin/binarize.py --config egs/datasets/audio/PopBuTFy/save_emb.yaml
- Pack the dataset:
CUDA_VISIBLE_DEVICES=0 python data_gen/tts/bin/binarize.py --config egs/datasets/audio/PopBuTFy/para_bin.yaml
Vocoder Preparation
We provide the pre-trained model of HifiGAN-Singing which is specially designed for SVS with NSF mechanism.
Please unzip pre-trained vocoder into checkpoints
before training your acoustic model.
This singing vocoder is trained on 100+ hours singing data (including Chinese and English songs).
PPG Extractor Preparation
We provide the pre-trained model of PPG Extractor.
Please unzip pre-trained PPG extractor into checkpoints
before training your acoustic model.
After the instructions above, the directory structure should be as follows:
.
|--data
|--processed
|--PopBuTFy (unzip PopBuTFy.zip)
|--data
|--directories containing wavs
|--binary
|--PopBuTFyENSpkEM
|--checkpoints
|--1009_pretrain_asr_english
|--
|--config.yaml
|--1012_hifigan_all_songs_nsf
|--
|--config.yaml
2. Training Example
CUDA_VISIBLE_DEVICES=0,1 python tasks/run.py --config egs/datasets/audio/PopBuTFy/vae_global_mle_eng.yaml --exp_name exp_name --reset
3. Inference
Inference from packed test set
CUDA_VISIBLE_DEVICES=0,1 python tasks/run.py --config egs/datasets/audio/PopBuTFy/vae_global_mle_eng.yaml --exp_name exp_name --reset --infer
Inference results will be saved in ./checkpoints/EXP_NAME/generated_
by default.
We will also provide:
- the pre-trained model of NSVB (WIP);
Remember to put the pre-trained models in checkpoints
directory.
Inference from raw inputs
WIP.
Limitations
See Appendix D "Limitations and Solutions" in our paper.
Citation
If this repository helps your research, please cite:
@inproceedings{liu-etal-2022-learning-beauty,
title = "Learning the Beauty in Songs: Neural Singing Voice Beautifier",
author = "Liu, Jinglin and
Li, Chengxi and
Ren, Yi and
Zhu, Zhiying and
Zhao, Zhou",
booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = may,
year = "2022",
address = "Dublin, Ireland",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.acl-long.549",
pages = "7970--7983",}
Issues
- Before raising a issue, please check our Readme and other issues for possible solutions.
- We will try to handle your problem in time but we could not guarantee a satisfying solution.
- Please be friendly.
Acknowledgements
- r9y9's wavenet_vocoder
- Po-Hsun-Su's ssim
- descriptinc's melgan
- Official espnet
- Official PyTorch Lightning
The framework of this repository is based on DiffSinger, and is a predecessor of NATSpeech.