• Stars
    star
    754
  • Rank 60,205 (Top 2 %)
  • Language
    Python
  • License
    Other
  • Created over 1 year ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Codes and Model of the paper "Text-to-Audio Generation using Instruction Tuned LLM and Latent Diffusion Model"

TANGO: Text to Audio using iNstruction-Guided diffusiOn

Paper | Model | Website and Examples | More Examples | Huggingface Demo | Replicate demo and API

πŸ”₯ The demo of TANGO is live on Huggingface Space

πŸ“£ We are releasing Tango-Full-FT-AudioCaps which was first pre-trained on TangoPromptBank, a collection of diverse text, audio pairs. We later fine tuned this checkpoint on AudioCaps. This checkpoint obtained state-of-the-art results for text-to-audio generation on AudioCaps.

πŸ“£ We are excited to share that Oracle Cloud has sponsored the project Tango.

Tango Model Family

Model Name Model Path
Tango https://huggingface.co/declare-lab/tango
Tango-Full-FT-Audiocaps (state-of-the-art) https://huggingface.co/declare-lab/tango-full-ft-audiocaps
Tango-Full-FT-Audio-Music-Caps https://huggingface.co/declare-lab/tango-full-ft-audio-music-caps
Tango-Full https://huggingface.co/declare-lab/tango-full

Description

TANGO is a latent diffusion model (LDM) for text-to-audio (TTA) generation. TANGO can generate realistic audios including human sounds, animal sounds, natural and artificial sounds and sound effects from textual prompts. We use the frozen instruction-tuned LLM Flan-T5 as the text encoder and train a UNet based diffusion model for audio generation. We perform comparably to current state-of-the-art models for TTA across both objective and subjective metrics, despite training the LDM on a 63 times smaller dataset. We release our model, training, inference code, and pre-trained checkpoints for the research community.

Quickstart Guide

Download the TANGO model and generate audio from a text prompt:

import IPython
import soundfile as sf
from tango import Tango

tango = Tango("declare-lab/tango")

prompt = "An audience cheering and clapping"
audio = tango.generate(prompt)
sf.write(f"{prompt}.wav", audio, samplerate=16000)
IPython.display.Audio(data=audio, rate=16000)
CheerClap.webm

The model will be automatically downloaded and saved in cache. Subsequent runs will load the model directly from cache.

The generate function uses 100 steps by default to sample from the latent diffusion model. We recommend using 200 steps for generating better quality audios. This comes at the cost of increased run-time.

prompt = "Rolling thunder with lightning strikes"
audio = tango.generate(prompt, steps=200)
IPython.display.Audio(data=audio, rate=16000)
Thunder.webm

Use the generate_for_batch function to generate multiple audio samples for a batch of text prompts:

prompts = [
    "A car engine revving",
    "A dog barks and rustles with some clicking",
    "Water flowing and trickling"
]
audios = tango.generate_for_batch(prompts, samples=2)

This will generate two samples for each of the three text prompts.

More generated samples are shown here.

Prerequisites

Our code is built on pytorch version 1.13.1+cu117. We mention torch==1.13.1 in the requirements file but you might need to install a specific cuda version of torch depending on your GPU device type.

Install requirements.txt.

git clone https://github.com/declare-lab/tango/
cd tango
pip install -r requirements.txt

You might also need to install libsndfile1 for soundfile to work properly in linux:

(sudo) apt-get install libsndfile1

Datasets

Follow the instructions given in the AudioCaps repository for downloading the data. The audio locations and corresponding captions are provided in our data directory. The *.json files are used for training and evaluation. Once you have downloaded your version of the data you should be able to map it using the file ids to the file locations provided in our data/*.json files.

Note that we cannot distribute the data because of copyright issues.

How to train?

We use the accelerate package from Hugging Face for multi-gpu training. Run accelerate config from terminal and set up your run configuration by the answering the questions asked.

You can now train TANGO on the AudioCaps dataset using:

accelerate launch train.py \
--text_encoder_name="google/flan-t5-large" \
--scheduler_name="stabilityai/stable-diffusion-2-1" \
--unet_model_config="configs/diffusion_model_config.json" \
--freeze_text_encoder --augment --snr_gamma 5 \

The argument --augment uses augmented data for training as reported in our paper. We recommend training for at-least 40 epochs, which is the default in train.py.

To start training from our released checkpoint use the --hf_model argument.

accelerate launch train.py \
--hf_model "declare-lab/tango" \
--unet_model_config="configs/diffusion_model_config.json" \
--freeze_text_encoder --augment --snr_gamma 5 \

Check train.py and train.sh for the full list of arguments and how to use them.

The training script should automatically download the AudioLDM weights from here. However if the download is slow or if you face any other issues then you can: i) download the audioldm-s-full file from here, ii) rename it to audioldm-s-full.ckpt, and iii) keep it in /home/user/.cache/audioldm/ direcrtory.

How to make inferences?

From your trained checkpoints

Checkpoints from training will be saved in the saved/*/ directory.

To perform audio generation and objective evaluation in AudioCaps test set from your trained checkpoint:

CUDA_VISIBLE_DEVICES=0 python inference.py \
--original_args="saved/*/summary.jsonl" \
--model="saved/*/best/pytorch_model_2.bin" \

Check inference.py and inference.sh for the full list of arguments and how to use them.

From our released checkpoints in Hugging Face Hub

To perform audio generation and objective evaluation in AudioCaps test set from our huggingface checkpoints:

python inference_hf.py --checkpoint="declare-lab/tango"

Note

We use functionalities from audioldm_eval for objective evalution in inference.py. It requires the gold reference audio files and generated audio files to have the same name. You need to create the directory data/audiocaps_test_references/subset and keep the reference audio files there. The files should have names as following: output_0.wav, output_1.wav and so on. The indices should correspond to the corresponding line indices in data/test_audiocaps_subset.json.

We use the term subset as some data instances originally released in AudioCaps have since been removed from YouTube and are no longer available. We thus evaluated our models on all the instances which were available as of 8th April, 2023.

We use wandb to log training and infernce results.

Experimental Results

Model Datasets Text #Params FD ↓ KL ↓ FAD ↓ OVL ↑ REL ↑
Ground truth βˆ’ βˆ’ βˆ’ βˆ’ βˆ’ βˆ’ 91.61 86.78
DiffSound AS+AC βœ“ 400M 47.68 2.52 7.75 βˆ’ βˆ’
AudioGen AS+AC+8 others βœ— 285M βˆ’ 2.09 3.13 βˆ’ βˆ’
AudioLDM-S AC βœ— 181M 29.48 1.97 2.43 βˆ’ βˆ’
AudioLDM-L AC βœ— 739M 27.12 1.86 2.08 βˆ’ βˆ’
AudioLDM-M-Full-FT‑ AS+AC+2 others βœ— 416M 26.12 1.26 2.57 79.85 76.84
AudioLDM-L-Full‑ AS+AC+2 others βœ— 739M 32.46 1.76 4.18 78.63 62.69
AudioLDM-L-Full-FT AS+AC+2 others βœ— 739M 23.31 1.59 1.96 βˆ’ βˆ’
TANGO AC βœ“ 866M 24.52 1.37 1.59 85.94 80.36

Citation

Please consider citing the following article if you found our work useful:

@article{ghosal2023tango,
  title={Text-to-Audio Generation using Instruction Tuned LLM and Latent Diffusion Model},
  author={Ghosal, Deepanway and Majumder, Navonil and Mehrish, Ambuj and Poria, Soujanya},
  journal={arXiv preprint arXiv:2304.13731},
  year={2023}
}

Limitations

TANGO is trained on the small AudioCaps dataset so it may not generate good audio samples related to concepts that it has not seen in training (e.g. singing). For the same reason, TANGO is not always able to finely control its generations over textual control prompts. For example, the generations from TANGO for prompts Chopping tomatoes on a wooden table and Chopping potatoes on a metal table are very similar. Chopping vegetables on a table also produces similar audio samples. Training text-to-audio generation models on larger datasets is thus required for the model to learn the composition of textual concepts and varied text-audio mappings.

We are training another version of TANGO on larger datasets to enhance its generalization, compositional and controllable generation ability.

Acknowledgement

We borrow the code in audioldm and audioldm_eval from the AudioLDM repositories. We thank the AudioLDM team for open-sourcing their code.

More Repositories

1

conv-emotion

This repo contains implementation of different architectures for emotion recognition in conversations.
Python
1,336
star
2

MELD

MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversation
Python
793
star
3

multimodal-deep-learning

This repository contains various models targetting multimodal representation learning, multimodal fusion for downstream tasks such as multimodal sentiment analysis.
OpenEdge ABL
730
star
4

awesome-sentiment-analysis

Reading list for Awesome Sentiment Analysis papers
517
star
5

instruct-eval

This repository contains code to quantitatively evaluate instruction-tuned models such as Alpaca and Flan-T5 on held-out tasks.
Python
500
star
6

flan-alpaca

This repository contains code for extending the Stanford Alpaca synthetic instruction tuning to existing instruction-tuned models such as Flan-T5.
Python
343
star
7

awesome-emotion-recognition-in-conversations

A comprehensive reading list for Emotion Recognition in Conversations
252
star
8

MISA

MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis
Python
192
star
9

RECCON

This repository contains the dataset and the PyTorch implementations of the models from the paper Recognizing Emotion Cause in Conversations.
Python
169
star
10

Multimodal-Infomax

This repository contains the official implementation code of the paper Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis, accepted at EMNLP 2021.
Python
154
star
11

dialogue-understanding

This repository contains PyTorch implementation for the baseline models from the paper Utterance-level Dialogue Understanding: An Empirical Study
Python
123
star
12

RelationPrompt

This repository implements our ACL Findings 2022 research paper RelationPrompt: Leveraging Prompts to Generate Synthetic Data for Zero-Shot Relation Triplet Extraction. The goal of Zero-Shot Relation Triplet Extraction (ZeroRTE) is to extract relation triplets of the format (head entity, tail entity, relation), despite not having annotated data for the test relation labels.
Python
123
star
13

contextual-utterance-level-multimodal-sentiment-analysis

Context-Dependent Sentiment Analysis in User-Generated Videos
Python
123
star
14

flacuna

Flacuna was developed by fine-tuning Vicuna on Flan-mini, a comprehensive instruction collection encompassing various tasks. Vicuna is already an excellent writing assistant, and the intention behind Flacuna was to enhance Vicuna's problem-solving capabilities. To achieve this, we curated a dedicated instruction dataset called Flan-mini.
Python
106
star
15

CASCADE

This repo contains code to detect sarcasm from text in discussion forum using deep learning
Python
86
star
16

red-instruct

Codes and datasets of the paper Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment
Python
75
star
17

BBFN

This repository contains the implementation of the paper -- Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal Sentiment Analysis
Python
62
star
18

CICERO

The purpose of this repository is to introduce new dialogue-level commonsense inference datasets and tasks. We chose dialogues as the data source because dialogues are known to be complex and rich in commonsense.
Python
60
star
19

dialog-HGAT

Dialogue Relation Extraction with Document-level Heterogeneous Graph Attention Networks
Python
57
star
20

kingdom

Domain Adaptation using External Knowledge for Sentiment Analysis
Python
55
star
21

hfusion

Multimodal sentiment analysis using hierarchical fusion with context modeling
Python
44
star
22

MIME

This repository contains PyTorch implementations of the models from the paper An Empirical Study MIME: MIMicking Emotions for Empathetic Response Generation.
Python
43
star
23

speech-adapters

Codes and datasets for our ICASSP2023 paper, Evaluating parameter-efficient transfer learning approaches on SURE benchmark for speech understanding
Python
40
star
24

LLM-PuzzleTest

This repository is maintained to release dataset and models for multimodal puzzle reasoning.
Python
33
star
25

HyperTTS

Python
33
star
26

MSA-Robustness

NAACL 2022 paper on Analyzing Modality Robustness in Multimodal Sentiment Analysis
Python
31
star
27

CIDER

This repository contains the dataset and the pytorch implementations of the models from the paper CIDER: Commonsense Inference for Dialogue Explanation and Reasoning. CIDER has been accepted to appear at SIGDIAL 2021.
Python
28
star
28

sentence-ordering

This repository contains the PyTorch implementation of the paper STaCK: Sentence Ordering with Temporal Commonsense Knowledge appearing at EMNLP 2021.
Python
28
star
29

resta

Restore safety in fine-tuned language models through task arithmetic
Python
25
star
30

HyperRED

This repository implements our EMNLP 2022 research paper A Dataset for Hyper-Relational Extraction and a Cube-Filling Approach.
Python
25
star
31

MM-InstructEval

This repository contains code to evaluate various multimodal large language models using different instructions across multiple multimodal content comprehension tasks.
Python
24
star
32

MM-Align

[EMNLP 2022] This repository contains the official implementation of the paper "MM-Align: Learning Optimal Transport-based Alignment Dynamics for Fast and Accurate Inference on Missing Modality Sequences"
Python
24
star
33

identifiable-transformers

Python
22
star
34

exemplary-empathy

This repository contains the source codes of the paper -- Exemplars-guided Empathetic Response Generation Controlled by the Elements of Human Communication
Python
22
star
35

della

DELLA-Merging: Reducing Interference in Model Merging through Magnitude-Based Sampling
Python
22
star
36

TEAM

Our EMNLP 2022 paper on MCQA
Python
21
star
37

DoubleMix

Code for the COLING 2022 paper "DoubleMix: Simple Interpolation-Based Data Augmentation for Text Classification"
Python
20
star
38

adapter-mix

Python
15
star
39

M2H2-dataset

This repository contains the dataset and baselines explained in the paper: M2H2: A Multimodal Multiparty Hindi Dataset For HumorRecognition in Conversations
Python
15
star
40

ASTE-RL

This repository contains the source codes for the paper: "Aspect Sentiment Triplet Extraction using Reinforcement Learning" published at CIKM 2021.
Python
14
star
41

SAT

Code for the EMNLP 2022 Findings short paper "SAT: Improving Semi-Supervised Text Classification with Simple Instance-Adaptive Self-Training"
Jupyter Notebook
13
star
42

KNOT

This repository contains the implementation of the paper -- KNOT: Knowledge Distillation using Optimal Transport for Solving NLP Tasks
Python
13
star
43

WikiDes

A Wikipedia-based summarization dataset
Python
13
star
44

Sealing

[NAACL 2024] Official Implementation of paper "Self-Adaptive Sampling for Efficient Video Question Answering on Image--Text Models"
Python
9
star
45

VIP

Our EMNLP 2022 paper on VIP-Based Prompting for Parameter-Efficient Learning
Python
8
star
46

domadapter

Code for EACL'23 paper "Udapter: Efficient Domain Adaptation Using Adapters"
Python
8
star
47

RobustMIFT

[Arxiv 2024] Official Implementation of the paper: "Towards Robust Instruction Tuning on Multimodal Large Language Models"
Jupyter Notebook
8
star
48

ferret

Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique
Python
8
star
49

SANCL

[COLING 2022] This repository contains the code of the paper SANCL: Multimodal Review Helpfulness Prediction with Selective Attention and Natural Contrastive Learning.
Python
7
star
50

segue

Codes and Checkpoints of the Interspeech paper "Sentence Embedder Guided Utterance Encoder (SEGUE) for Spoken Language Understanding"
Python
6
star
51

llm_robustness

Python
3
star
52

NLP-OT

1
star