• Stars
    star
    165
  • Rank 227,548 (Top 5 %)
  • Language
    Python
  • License
    MIT License
  • Created about 1 year ago
  • Updated 4 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

AudioLDM training, finetuning, evaluation and inference.

arXiv arXiv

🔊 AudioLDM training, finetuning, inference and evaluation

Prepare Python running environment

# Create conda environment
conda create -n audioldm_train python=3.10
conda activate audioldm_train
# Clone the repo
git clone https://github.com/haoheliu/AudioLDM-training-finetuning.git; cd AudioLDM-training-finetuning
# Install running environment
pip install poetry
poetry install

Download checkpoints and dataset

  1. Download checkpoints from Google Drive: link. The checkpoints including pretrained VAE, AudioMAE, CLAP, 16kHz HiFiGAN, and 48kHz HiFiGAN.
  2. Uncompress the checkpoint tar file and place the content into data/checkpoints/
  3. Download the preprocessed AudioCaps from Google Drive: link
  4. Similarly, uncompress the dataset tar file and place the content into data/dataset

To double check if dataset or checkpoints are ready, run the following command:

python3 tests/validate_dataset_checkpoint.py

If the structure is not correct or partly missing. You will see the error message.

Play around with the code

Train the AudioLDM model

# Train the AudioLDM (latent diffusion part)
python3 audioldm_train/train/latent_diffusion.py -c audioldm_train/config/2023_08_23_reproduce_audioldm/audioldm_original.yaml

# Train the VAE (Optional)
# python3 audioldm_train/train/autoencoder.py -c audioldm_train/config/2023_11_13_vae_autoencoder/16k_64.yaml

The program will perform generation on the evaluation set every 5 epochs of training. After obtaining the audio generation folders (named val_), you can proceed to the next step for model evaluation.

Finetuning of the pretrained model

You can finetune with two pretrained checkpoint, first download the one that you like (e.g., using wget):

  1. Medium size AudioLDM: https://zenodo.org/records/7884686/files/audioldm-m-full.ckpt
  2. Small size AudioLDM: https://zenodo.org/records/7884686/files/audioldm-s-full

Place the checkpoint in the data/checkpoints folder

Then perform finetuning with one of the following commands:

# Medium size AudioLDM
python3 audioldm_train/train/latent_diffusion.py -c audioldm_train/config/2023_08_23_reproduce_audioldm/audioldm_original_medium.yaml --reload_from_ckpt data/checkpoints/audioldm-m-full.ckpt

# Small size AudioLDM
python3 audioldm_train/train/latent_diffusion.py -c audioldm_train/config/2023_08_23_reproduce_audioldm/audioldm_original.yaml --reload_from_ckpt data/checkpoints/audioldm-s-full

You can specify your own dataset following the same format as the provided AudioCaps dataset.

Note that the pretrained AudioLDM checkpoints are under CC-by-NC 4.0 license, which is not allowed for commerial use.

Evaluate the model output

Automatically evaluation based on each of the folder with generated audio

# Evaluate all existing generated folder
python3 audioldm_train/eval.py --log_path all

# Evaluate only a specific experiment folder
python3 audioldm_train/eval.py --log_path <path-to-the-experiment-folder>

The evaluation result will be saved in a json file at the same level of the audio folder.

Inference with the pretrained model

Use the following syntax:

python3 audioldm_train/infer.py --config_yaml <The-path-to-the-same-config-file-you-use-for-training> --list_inference <the-filelist-you-want-to-generate>

For example:

# Please make sure you have train the model using audioldm_crossattn_flant5.yaml
# The generated audio will be saved at the same log folder if the pretrained model.
python3 audioldm_train/infer.py --config_yaml audioldm_train/config/2023_08_23_reproduce_audioldm/audioldm_crossattn_flant5.yaml --list_inference tests/captionlist/inference_test.lst

The generated audio will be named with the caption by default. If you like to specify the filename to use, please checkout the format of tests/captionlist/inference_test_with_filename.lst.

This repo only support inference with the model you trained by yourself. If you want to use the pretrained model directly, please use these two repos: AudioLDM and AudioLDM2.

Train the model using your own dataset

Super easy, simply follow these steps:

  1. Prepare the metadata with the same format as the provided AudioCaps dataset.
  2. Register in the metadata of your dataset in data/dataset/metadata/dataset_root.json
  3. Use your dataset in the YAML file.

You do not need to resample or pre-segment the audiofile. The dataloader will do most of the jobs.

Cite this work

If you found this tool useful, please consider citing

@article{liu2023audioldm,
  title={{AudioLDM}: Text-to-Audio Generation with Latent Diffusion Models},
  author={Liu, Haohe and Chen, Zehua and Yuan, Yi and Mei, Xinhao and Liu, Xubo and Mandic, Danilo and Wang, Wenwu and Plumbley, Mark D},
  journal={Proceedings of the International Conference on Machine Learning},
  year={2023}
}
@article{liu2023audioldm2,
  title={{AudioLDM 2}: Learning Holistic Audio Generation with Self-supervised Pretraining},
  author={Haohe Liu and Qiao Tian and Yi Yuan and Xubo Liu and Xinhao Mei and Qiuqiang Kong and Yuping Wang and Wenwu Wang and Yuxuan Wang and Mark D. Plumbley},
  journal={arXiv preprint arXiv:2308.05734},
  year={2023}
}

Acknowledgement

We greatly appreciate the open-soucing of the following code bases. Open source code base is the real-world infinite stone 💎!

This research was partly supported by the British Broadcasting Corporation Research and Development, Engineering and Physical Sciences Research Council (EPSRC) Grant EP/T019751/1 "AI for Sound", and a PhD scholarship from the Centre for Vision, Speech and Signal Processing (CVSSP), Faculty of Engineering and Physical Science (FEPS), University of Surrey. For the purpose of open access, the authors have applied a Creative Commons Attribution (CC BY) license to any Author Accepted Manuscript version arising. We would like to thank Tang Li, Ke Chen, Yusong Wu, Zehua Chen and Jinhua Liang for their support and discussions.

More Repositories

1

AudioLDM

AudioLDM: Generate speech, sound effects, music and beyond, with text.
Python
2,310
star
2

AudioLDM2

Text-to-Audio/Music Generation
Python
2,187
star
3

versatile_audio_super_resolution

Versatile audio super resolution (any -> 48kHz) with AudioSR.
Python
963
star
4

voicefixer

General Speech Restoration
Python
952
star
5

audioldm_eval

This toolbox aims to unify audio generation model evaluation for easier comparison.
Python
275
star
6

voicefixer_main

General Speech Restoration
Python
271
star
7

ssr_eval

Evaluation and Benchmarking of Speech Super-resolution Methods
Python
129
star
8

2021-ISMIR-MSS-Challenge-CWS-PResUNet

Music Source Separation; Train & Eval & Inference piplines and pretrained models we used for 2021 ISMIR MDX Challenge.
Python
113
star
9

SemantiCodec-inference

Ultra-low bitrate neural audio codec (0.31~1.40 kbps) with a better semantic in the latent space.
Python
111
star
10

Subband-Music-Separation

Pytorch: Channel-wise subband (CWS) input for better voice and accompaniment separation
Python
89
star
11

torchsubband

Pytorch implementation of subband decomposition
HTML
78
star
12

SemantiCodec

HTML
37
star
13

diffres-python

Learning differentiable temporal resolution on time-series data.
Python
30
star
14

DCASE_2022_Task_5

System that ranks 2nd in DCASE 2022 Challenge Task 5: Few-shot Bioacoustic Event Detection
Python
27
star
15

ontology-aware-audio-tagging

Python
13
star
16

courseProject_Compiler

java implementation of NWPU Compiler course project-西工大编译原理-试点班
Java
13
star
17

Key-word-spotting-DNN-GRU-DSCNN

key word spotting GRU/DNN/DSCNN
Python
8
star
18

DM_courseProject

KNN Bayes 西北工业大学 NWPU 数据挖掘与分析
Python
6
star
19

netease_downloader

网易云音乐上以歌单为单位进行下载
Python
3
star
20

Channel-wise-Subband-Input

The demos of paper: Channel-wise Subband Input for Better Voice and Accompaniment Separation on High Resolution Music
Jupyter Notebook
2
star
21

haoheliu.github.io

SCSS
1
star
22

demopage-NVSR

HTML
1
star
23

deepDecagon

Python
1
star
24

visa-monitor

实时监控可预约签证的时间,有更早的就邮件通知
Python
1
star
25

colab_collection

Jupyter Notebook
1
star
26

SatProj

西北工业大学应用综合实验
Python
1
star
27

demopage-voicefixer

Voicefixer is a speech restoration model that handles noise, reverberation, low resolution (2kHz~44.1kHz), and clipping (0.1-1.0 threshold) distortion simultaneously.
HTML
1
star
28

mushra_test_2024_April

1
star