• Stars
    star
    226
  • Rank 175,506 (Top 4 %)
  • Language
    Python
  • License
    MIT License
  • Created almost 3 years ago
  • Updated over 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Making an AI-generated music video from any song with Wav2CLIP and VQGAN-CLIP

music2video Overview

A repo for making a AI-generated music video from any song with Wav2CLIP and VQGAN-CLIP.

The base code was derived from VQGAN-CLIP The CLIP embedding for audio was derived from Wav2CLIP

A technical paper describing the mechanism is provide in the following link: Music2Video: Automatic Generation of Music Video with fusion of audio and text

The citation for the technical paper is provided below:

@article{jang2022music2video,
  title={Music2Video: Automatic Generation of Music Video with fusion of audio and text},
  author={Jang, Joel and Shin, Sumin and Kim, Yoonjeon},
  journal={arXiv preprint arXiv:2201.03809},
  year={2022}
}

Sample

A sample of a music video created with this repository is available at this youtube link Here is a sample of snapshots in a generated music-video with its lyrics: sample

You can make one with your own song too!

Set up

This example uses Anaconda to manage virtual Python environments.

Create a new virtual Python environment for VQGAN-CLIP:

conda create --name vqgan python=3.9
conda activate vqgan

Install Pytorch in the new enviroment:

Note: This installs the CUDA version of Pytorch, if you want to use an AMD graphics card, read the AMD section below.

pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html

Install other required Python packages:

pip install ftfy regex tqdm omegaconf pytorch-lightning IPython kornia imageio imageio-ffmpeg einops torch_optimizer wav2clip

Or use the requirements.txt file, which includes version numbers.

Clone required repositories:

git clone 'https://github.com/nerdyrodent/VQGAN-CLIP'
cd VQGAN-CLIP
git clone 'https://github.com/openai/CLIP'
git clone 'https://github.com/CompVis/taming-transformers'

Note: In my development environment both CLIP and taming-transformers are present in the local directory, and so aren't present in the requirements.txt or vqgan.yml files.

As an alternative, you can also pip install taming-transformers and CLIP.

You will also need at least 1 VQGAN pretrained model. E.g.

mkdir checkpoints

curl -L -o checkpoints/vqgan_imagenet_f16_16384.yaml -C - 'https://heibox.uni-heidelberg.de/d/a7530b09fed84f80a887/files/?p=%2Fconfigs%2Fmodel.yaml&dl=1' #ImageNet 16384
curl -L -o checkpoints/vqgan_imagenet_f16_16384.ckpt -C - 'https://heibox.uni-heidelberg.de/d/a7530b09fed84f80a887/files/?p=%2Fckpts%2Flast.ckpt&dl=1' #ImageNet 16384

Note that users of curl on Microsoft Windows should use double quotes.

The download_models.sh script is an optional way to download a number of models. By default, it will download just 1 model.

See https://github.com/CompVis/taming-transformers#overview-of-pretrained-models for more information about VQGAN pre-trained models, including download links.

By default, the model .yaml and .ckpt files are expected in the checkpoints directory. See https://github.com/CompVis/taming-transformers for more information on datasets and models.

Making the music video

To generate video from music, please specify your music and the following code examples can be used depending on the need. We provide a sample music file & lyrics file from Yannic Kilcher's repo.

If you have a lyrics file with time-stamp information such as the example in 'lyrics/imagenet_song_lyrics.csv', you can make a lyrics-audio guided music video with the following command:

python generate.py -vid -o outputs/output.png -ap "imagenet_song.mp3" -lyr "lyrics/imagenet_song_lyrics.csv" -gid 2 -ips 100

To interpolate between audio representation and text representation, use to following code (gives a more "music video" feeling)

python generate_interpolate.py -vid -ips 100 -o outputs/output.png -ap "imagenet_song.mp3" -lyr "lyrics/imagenet_song_lyrics.csv" -gid 0

If you do not have lyrics information, you can run the following command using only audio prompts:

python generate.py -vid -o outputs/output.png -ap "imagenet_song.mp3" -gid 2 -ips 100

If there was an error with any of the above commands during merging of the video segments, please use combine_mp4.py to separately concat the video segments from the output directory or download the video segments from output directory and manually merge them using video editing software.

Citations

@misc{unpublished2021clip,
    title  = {CLIP: Connecting Text and Images},
    author = {Alec Radford, Ilya Sutskever, Jong Wook Kim, Gretchen Krueger, Sandhini Agarwal},
    year   = {2021}
}
@misc{esser2020taming,
      title={Taming Transformers for High-Resolution Image Synthesis}, 
      author={Patrick Esser and Robin Rombach and Bjรถrn Ommer},
      year={2020},
      eprint={2012.09841},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
@article{wu2021wav2clip,
  title={Wav2CLIP: Learning Robust Audio Representations From CLIP},
  author={Wu, Ho-Hsiang and Seetharaman, Prem and Kumar, Kundan and Bello, Juan Pablo},
  journal={arXiv preprint arXiv:2110.11499},
  year={2021}
}

More Repositories

1

ELM

[ICML 2023] Exploring the Benefits of Training Expert Language Models over Instruction Tuning
Python
97
star
2

continual-knowledge-learning

[ICLR 2022] Towards Continual Knowledge Learning of Language Models
Python
93
star
3

RLPHF

Personalized Soups: Personalized Large Language Model Alignment via Post-hoc Parameter Merging
Python
85
star
4

knowledge-unlearning

[ACL 2023] Knowledge Unlearning for Mitigating Privacy Risks in Language Models
Python
69
star
5

temporalwiki

[EMNLP 2022] TemporalWiki: A Lifelong Benchmark for Training and Evaluating Ever-Evolving Language Models
Python
65
star
6

Pretraining_T5_custom_dataset

Continue Pretraining T5 on custom dataset based on available pretrained model checkpoints
Python
38
star
7

negated-prompts-for-llms

[NeurIPS 2022 Workshop] A Case Study with Negated Prompts using T0 (3B, 11B), InstructGPT (350M-175B), GPT-3 (350M - 175B) & OPT (125M - 175B) LMs
Python
23
star
8

Remaining-Useful-Life-Prediction

Remaining Useful Life prediction of machinery using a novel data wrangling method and CNN-LSTM network for prediction
Python
21
star
9

FLM

All-in-one repository for Fine-tuning & Pretraining (Large) Language Models
Python
15
star
10

T5_QA

Pretraining & Finetuning T5 for CBQA
Python
7
star
11

salient-span-masking

Code used for salient span masking first proposed in "REALM: Retrieval-Augmented Language Model Pre-Training"
Python
2
star
12

Sequential-Targeting

Code for Sequential Targeting: A Continual Learning Approach for Data Imbalance in Text Classification
Python
1
star
13

azcopy12-script

Simple scripts for downloading/uploading files & directories from azure blob storage using Azcopy v12
Python
1
star