• Stars
    star
    340
  • Rank 123,552 (Top 3 %)
  • Language
    Python
  • Created over 1 year ago
  • Updated 5 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Code, Dataset, and Pretrained Models for Audio and Speech Large Language Model "Listen, Think, and Understand".

Listen, Think, and Understand

Illustration of CAV-MAE.

Illustration of CAV-MAE.


LTU-AS (Second Generation):

LTU-AS was accepted at ASRU 2023. See you in Taipei!

[Paper] [HuggingFace Space] [ASRU Peer Review] [Compare LTU-1 and LTU-AS]

Authors: Yuan Gong, Alexander H. Liu, Hongyin Luo, Leonid Karlinsky, and James Glass (MIT & MIT-IBM Watson AI Lab)


LTU (First Generation):

[Paper] [HuggingFace Space]

Authors: Yuan Gong, Hongyin Luo, Alexander H. Liu, Leonid Karlinsky, and James Glass (MIT & MIT-IBM Watson AI Lab)


Abstract:

The ability of artificial intelligence (AI) systems to perceive and comprehend audio signals is crucial for many applications. Although significant progress has been made in this area since the development of AudioSet, most existing models are designed to map audio inputs to pre-defined, discrete sound label sets. In contrast, humans possess the ability to not only classify sounds into coarse-grained categories, but also to listen to the details of the sounds, explain the reason for the predictions, think what the sound infers, and understand the scene and what action needs to be taken. Such capabilities beyond perception are not yet present in existing audio models. On the other hand, modern large language models (LLMs) exhibit emerging reasoning ability but they lack audio perception capabilities. Therefore, we ask the question: can we build an AI model that has both audio perception and a reasoning ability?

In this paper, we propose a novel audio foundation model, called LTU (Listen, Think, and Understand). To train LTU, we created a new OpenAQA-5M dataset consisting of 1.9 million closed-ended and 3.7 million open-ended, diverse (audio, question, answer) tuples, and used an autoregressive training framework and a perception-to-understanding curriculum. LTU demonstrates strong performance and generalization ability on conventional audio tasks such as classification and captioning. Moreover, it exhibits remarkable reasoning and comprehension abilities in the audio domain. To the best of our knowledge, LTU is the first audio-enabled large language model that bridges audio perception with advanced reasoning.

How about the code? We plan to release the code but our institute needs to review the software release, we are working on preparing for the review.

More Repositories

1

ast

Code for the Interspeech 2021 paper "AST: Audio Spectrogram Transformer".
Jupyter Notebook
1,077
star
2

ssast

Code for the AAAI 2022 paper "SSAST: Self-Supervised Audio Spectrogram Transformer".
Python
358
star
3

whisper-at

Code and Pretrained Models for Interspeech 2023 Paper "Whisper-AT: Noise-Robust Automatic Speech Recognizers are Also Strong Audio Event Taggers"
Python
303
star
4

cav-mae

Code and Pretrained Models for ICLR 2023 Paper "Contrastive Audio-Visual Masked Autoencoder".
Python
217
star
5

gopt

Code for the ICASSP 2022 paper "Transformer-Based Multi-Aspect Multi-Granularity Non-native English Speaker Pronunciation Assessment".
Python
139
star
6

psla

Code for the TASLP paper "PSLA: Improving Audio Tagging With Pretraining, Sampling, Labeling, and Aggregation".
Python
131
star
7

vocalsound

Dataset and baseline code for the VocalSound dataset (ICASSP2022).
Jupyter Notebook
95
star
8

uavm

Code for the IEEE Signal Processing Letters 2022 paper "UAVM: Towards Unifying Audio and Visual Models".
Python
54
star
9

python-compute-eer

Simple Python script to compute equal error rate (EER) for machine learning model evaluation.
Python
37
star
10

ReMASC

ReMASC: Realistic Replay Attack Corpus for Voice Controlled Systems
Python
36
star
11

realtime-adversarial-attack

Code for IJCAI 2019 paper "Real-time Adversarial Attack".
Python
20
star
12

llm_speech_emotion_challenge

Jupyter Notebook
11
star
13

multichannel-antispoof

Code for SPL paper "Detecting Replay Attacks Using Multi-Channel Audio: A Neural Network-Based Method"
Python
5
star
14

efficient-voice-antispoof

Jupyter Notebook
4
star
15

kaldi-abbr

kaldi name convention note
2
star
16

yuangongnd.github.io

HTML
2
star
17

Speech_DB_Engine

Python
1
star