• Stars
    star
    2,187
  • Rank 20,934 (Top 0.5 %)
  • Language
    Python
  • License
    Other
  • Created about 1 year ago
  • Updated about 2 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Text-to-Audio/Music Generation

AudioLDM 2

arXiv githubio Hugging Face Spaces

This repo currently support Text-to-Audio (including Music) and Text-to-Speech Generation.


Change Log

  • 2023-08-27: Add two new checkpoints!
    • 🌟 48kHz AudioLDM model: Now we support high-fidelity audio generation! Hugging Face Spaces
    • 16kHz improved AudioLDM model: Trained with more data and optimized model architecture.

TODO

  • Add the text-to-speech checkpoint
  • Open-source the AudioLDM training code.
  • Support the generation of longer audio (> 10s)
  • Optimizing the inference speed of the model.
  • Integration with the Diffusers library (see 🧨 Diffusers)
  • Add the style-transfer and inpainting code for the audioldm_48k checkpoint (PR welcomed, same logic as AudioLDMv1)

Web APP

  1. Prepare running environment
conda create -n audioldm python=3.8; conda activate audioldm
pip3 install git+https://github.com/haoheliu/AudioLDM2.git
git clone https://github.com/haoheliu/AudioLDM2; cd AudioLDM2
  1. Start the web application (powered by Gradio)
python3 app.py
  1. A link will be printed out. Click the link to open the browser and play.

Commandline Usage

Installation

Prepare running environment

# Optional
conda create -n audioldm python=3.8; conda activate audioldm
# Install AudioLDM
pip3 install git+https://github.com/haoheliu/AudioLDM2.git

If you plan to play around with text-to-speech generation. Please also make sure you have installed espeak. On linux you can do it by

sudo apt-get install espeak

Run the model in commandline

  • Generate sound effect or Music based on a text prompt
audioldm2 -t "Musical constellations twinkling in the night sky, forming a cosmic melody."
  • Generate sound effect or music based on a list of text
audioldm2 -tl batch.lst
  • Generate speech based on (1) the transcription and (2) the description of the speaker
audioldm2 -t "A female reporter is speaking full of emotion" --transcription "Wish you have a good day"

audioldm2 -t "A female reporter is speaking" --transcription "Wish you have a good day"

Text-to-Speech use the audioldm2-speech-gigaspeech checkpoint by default. If you like to run TTS with LJSpeech pretrained checkpoint, simply set --model_name audioldm2-speech-ljspeech.

Random Seed Matters

Sometimes model may not perform well (sounds wired or low quality) when changing into a different hardware. In this case, please adjust the random seed and find the optimal one for your hardware.

audioldm2 --seed 1234 -t "Musical constellations twinkling in the night sky, forming a cosmic melody."

Pretrained Models

You can choose model checkpoint by setting up "model_name":

# CUDA
audioldm2 --model_name "audioldm2-full" --device cuda -t "Musical constellations twinkling in the night sky, forming a cosmic melody."

# MPS
audioldm2 --model_name "audioldm2-full" --device mps -t "Musical constellations twinkling in the night sky, forming a cosmic melody."

We have five checkpoints you can choose:

  1. audioldm2-full (default): Generate both sound effect and music generation with the AudioLDM2 architecture.
  2. audioldm_48k: This checkpoint can generate high fidelity sound effect and music.
  3. audioldm_16k_crossattn_t5: The improved version of AudioLDM 1.0.
  4. audioldm2-full-large-1150k: Larger version of audioldm2-full.
  5. audioldm2-music-665k: Music generation.
  6. audioldm2-speech-gigaspeech (default for TTS): Text-to-Speech, trained on GigaSpeech Dataset.
  7. audioldm2-speech-ljspeech: Text-to-Speech, trained on LJSpeech Dataset.

We currently support 3 devices:

  • cpu
  • cuda
  • mps ( Notice that the computation requires about 20GB of RAM. )

Other options

  usage: audioldm2 [-h] [-t TEXT] [-tl TEXT_LIST] [-s SAVE_PATH]
                 [--model_name {audioldm_48k, audioldm_16k_crossattn_t5, audioldm2-full,audioldm2-music-665k,audioldm2-full-large-1150k,audioldm2-speech-ljspeech,audioldm2-speech-gigaspeech}] [-d DEVICE]
                 [-b BATCHSIZE] [--ddim_steps DDIM_STEPS] [-gs GUIDANCE_SCALE] [-n N_CANDIDATE_GEN_PER_TEXT]
                 [--seed SEED]

  optional arguments:
    -h, --help            show this help message and exit
    -t TEXT, --text TEXT  Text prompt to the model for audio generation
    --transcription TRANSCRIPTION
                        Transcription used for speech synthesis
    -tl TEXT_LIST, --text_list TEXT_LIST
                          A file that contains text prompt to the model for audio generation
    -s SAVE_PATH, --save_path SAVE_PATH
                          The path to save model output
    --model_name {audioldm2-full,audioldm2-music-665k,audioldm2-full-large-1150k,audioldm2-speech-ljspeech,audioldm2-speech-gigaspeech}
                          The checkpoint you gonna use
    -d DEVICE, --device DEVICE
                          The device for computation. If not specified, the script will automatically choose the device based on your environment. [cpu, cuda, mps, auto]
    -b BATCHSIZE, --batchsize BATCHSIZE
                          Generate how many samples at the same time
    --ddim_steps DDIM_STEPS
    -dur DURATION, --duration DURATION
                        The duration of the samples
                          The sampling step for DDIM
    -gs GUIDANCE_SCALE, --guidance_scale GUIDANCE_SCALE
                          Guidance scale (Large => better quality and relavancy to text; Small => better diversity)
    -n N_CANDIDATE_GEN_PER_TEXT, --n_candidate_gen_per_text N_CANDIDATE_GEN_PER_TEXT
                          Automatic quality control. This number control the number of candidates (e.g., generate three audios and choose the best to show you). A Larger value usually lead to better quality with
                          heavier computation
    --seed SEED           Change this value (any integer number) will lead to a different generation result.

Hugging Face 🧨 Diffusers

AudioLDM 2 is available in the Hugging Face 🧨 Diffusers library from v0.21.0 onwards. The official checkpoints can be found on the Hugging Face Hub, alongside documentation and examples scripts.

The Diffusers version of the code runs upwards of 3x faster than the native AudioLDM 2 implementation, and supports generating audios of arbitrary length.

To install 🧨 Diffusers and 🤗 Transformers, run:

pip install --upgrade git+https://github.com/huggingface/diffusers.git transformers accelerate

You can then load pre-trained weights into the AudioLDM2 pipeline, and generate text-conditional audio outputs by providing a text prompt:

from diffusers import AudioLDM2Pipeline
import torch
import scipy

repo_id = "cvssp/audioldm2"
pipe = AudioLDM2Pipeline.from_pretrained(repo_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")

prompt = "Techno music with a strong, upbeat tempo and high melodic riffs."
audio = pipe(prompt, num_inference_steps=200, audio_length_in_s=10.0).audios[0]

scipy.io.wavfile.write("techno.wav", rate=16000, data=audio)

Tips for obtaining high-quality generations can be found under the AudioLDM 2 docs, including the use of prompt engineering and negative prompting.

Tips for optimising inference speed can be found in the blog post AudioLDM 2, but faster ⚡️.

Cite this work

If you found this tool useful, please consider citing

@article{liu2023audioldm2,
  title={{AudioLDM 2}: Learning Holistic Audio Generation with Self-supervised Pretraining},
  author={Haohe Liu and Qiao Tian and Yi Yuan and Xubo Liu and Xinhao Mei and Qiuqiang Kong and Yuping Wang and Wenwu Wang and Yuxuan Wang and Mark D. Plumbley},
  journal={arXiv preprint arXiv:2308.05734},
  year={2023}
}
@article{liu2023audioldm,
  title={{AudioLDM}: Text-to-Audio Generation with Latent Diffusion Models},
  author={Liu, Haohe and Chen, Zehua and Yuan, Yi and Mei, Xinhao and Liu, Xubo and Mandic, Danilo and Wang, Wenwu and Plumbley, Mark D},
  journal={Proceedings of the International Conference on Machine Learning},
  year={2023}
}

More Repositories

1

AudioLDM

AudioLDM: Generate speech, sound effects, music and beyond, with text.
Python
2,310
star
2

versatile_audio_super_resolution

Versatile audio super resolution (any -> 48kHz) with AudioSR.
Python
963
star
3

voicefixer

General Speech Restoration
Python
952
star
4

audioldm_eval

This toolbox aims to unify audio generation model evaluation for easier comparison.
Python
275
star
5

voicefixer_main

General Speech Restoration
Python
271
star
6

AudioLDM-training-finetuning

AudioLDM training, finetuning, evaluation and inference.
Python
165
star
7

ssr_eval

Evaluation and Benchmarking of Speech Super-resolution Methods
Python
129
star
8

2021-ISMIR-MSS-Challenge-CWS-PResUNet

Music Source Separation; Train & Eval & Inference piplines and pretrained models we used for 2021 ISMIR MDX Challenge.
Python
113
star
9

SemantiCodec-inference

Ultra-low bitrate neural audio codec (0.31~1.40 kbps) with a better semantic in the latent space.
Python
111
star
10

Subband-Music-Separation

Pytorch: Channel-wise subband (CWS) input for better voice and accompaniment separation
Python
89
star
11

torchsubband

Pytorch implementation of subband decomposition
HTML
78
star
12

SemantiCodec

HTML
37
star
13

diffres-python

Learning differentiable temporal resolution on time-series data.
Python
30
star
14

DCASE_2022_Task_5

System that ranks 2nd in DCASE 2022 Challenge Task 5: Few-shot Bioacoustic Event Detection
Python
27
star
15

ontology-aware-audio-tagging

Python
13
star
16

courseProject_Compiler

java implementation of NWPU Compiler course project-西工大编译原理-试点班
Java
13
star
17

Key-word-spotting-DNN-GRU-DSCNN

key word spotting GRU/DNN/DSCNN
Python
8
star
18

DM_courseProject

KNN Bayes 西北工业大学 NWPU 数据挖掘与分析
Python
6
star
19

netease_downloader

网易云音乐上以歌单为单位进行下载
Python
3
star
20

Channel-wise-Subband-Input

The demos of paper: Channel-wise Subband Input for Better Voice and Accompaniment Separation on High Resolution Music
Jupyter Notebook
2
star
21

haoheliu.github.io

SCSS
1
star
22

demopage-NVSR

HTML
1
star
23

deepDecagon

Python
1
star
24

visa-monitor

实时监控可预约签证的时间,有更早的就邮件通知
Python
1
star
25

colab_collection

Jupyter Notebook
1
star
26

SatProj

西北工业大学应用综合实验
Python
1
star
27

demopage-voicefixer

Voicefixer is a speech restoration model that handles noise, reverberation, low resolution (2kHz~44.1kHz), and clipping (0.1-1.0 threshold) distortion simultaneously.
HTML
1
star
28

mushra_test_2024_April

1
star