• Stars
    star
    423
  • Rank 101,915 (Top 3 %)
  • Language
  • Created over 1 year ago
  • Updated 8 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Keep track of big models in audio domain, including speech, singing, music etc.

Large-Audio-Models

We keep track of something big in the audio domain, including speech, singing, music etc.

Contents

Prompt-based Audio Synthesis

  • TANGO: Text-to-Audio Generation using Instruction Tuned LLM and Latent Diffusion Model(2023), Deepanway Ghosal et al. [PDF]
  • Diverse and Vivid Sound Generation from Text Descriptions(2023), Guangwei Li et al. [PDF]
  • NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers(2023), Kai Shen et al. [PDF]
  • AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models(2023), Yuancheng Wang et al. [PDF]
  • Physics-Driven Diffusion Models for Impact Sound Synthesis from Videos(2023), Kun Su et al. [PDF]
  • FoundationTTS: Text-to-Speech for ASR Customization with Generative Language Model(2023), Ruiqing Xue et al. [PDF]
  • VALL-E X: Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling (2023), Ziqiang Zhang et al. [PDF]
  • Simple and Controllable Music Generation(2023), Jade Copet et al. [PDF]
  • Efficient Neural Music Generation(2023), Max W. Y. Lam et al. [PDF]
  • ERNIE-Music: Text-to-Waveform Music Generation with Diffusion Models(2023), Pengfei Zhu et al. [PDF]
  • Noise2Music: Text-conditioned Music Generation with Diffusion Models(2023), Qingqing Huang et al. [PDF]
  • Spear-TTS: Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision(2023), Eugene Kharitonov et al. [PDF]
  • SingSong: Generating musical accompaniments from singing(2023), Chris Donahue et al. [PDF]
  • MusicLM: Generating Music From Text(2023), Andrea Agostinelli et al. [PDF]
  • InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt (2023), Dongchao Yang et al. [PDF]
  • Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation(2023), Rongjie Huang et al. [PDF]
  • AudioLDM: Text-to-Audio Generation with Latent Diffusion Models(2023), Haohe Liu et al. [PDF]
  • Moûsai: Text-to-Music Generation with Long-Context Latent Diffusion(2023), Flavio Schneider et al. [PDF]
  • Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models(2023), Jiawei Huang et al. [PDF]
  • ArchiSound: Audio Generation with Diffusion(2023), Flavio Schneider. [PDF]
  • VALL-E: Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers (2023), Chengyi Wang et al. [PDF]
  • PromptTTS: Controllable Text-to-Speech with Text Descriptions(2022), Zhifang Guo et al. [PDF]
  • Diffsound: Discrete Diffusion Model for Text-to-sound Generation(2022), Dongchao Yang et al. [PDF]

Audio Language Models

  • SoundStorm: Efficient Parallel Audio Generation(2023), Zalán Borsos et al. [PDF]
  • AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head(2023), Rongjie Huang et al. [PDF]
  • AudioPaLM: A Large Language Model That Can Speak and Listen(2023), Paul K. Rubenstein et al. [PDF]
  • Pengi: An Audio Language Model for Audio Tasks(2023), Soham Deshmukh et al. [PDF]
  • AudioLM: a Language Modeling Approach to Audio Generation(2022), Zalán Borsos et al. [PDF]

Audio SSL and UL models

  • vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations(2019), Alexei Baevski et al. [PDF]
  • wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations (2020), Alexei Baevski et al. [PDF]
  • W2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training (2021) [PDF]
  • HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units (2021) Wei-Ning Hsu et al. [PDF]
  • Data2vec: A general framework for self-supervised learning in speech, vision and language (2022), Alexei Baevski et al. [PDF]
  • ContentVec: An Improved Self-Supervised Speech Representation by Disentangling Speakers (2022), Kaizhi Qian et al. [PDF]
  • MuLan: A Joint Embedding of Music Audio and Natural Language (2022) Qingqing Huang et al. [PDF]