• Stars
    star
    1,323
  • Rank 35,540 (Top 0.8 %)
  • Language
    Python
  • License
    MIT License
  • Created almost 7 years ago
  • Updated about 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

WaveGAN: Learn to synthesize raw audio with generative adversarial networks

WaveGAN (v2)

Official implementation of WaveGAN, a machine learning algorithm which learns to generate raw audio waveforms.

UPDATE (2/2/19): We have made substantial improvements to this repository in response to common requests:

  • Added streaming data loader allowing you to train a WaveGAN on MP3s/WAVs/OGGs/etc. without preprocessing
  • Added ability to train WaveGANs capable of generating longer audio examples (up to 4 seconds at 16kHz)
  • Added support for any audio sample rate
  • Added support for multi-channel audio
  • Compatibility with Python 3 and Tensorflow 1.12.0
  • Old (v1) version still available at this tag

This is the official TensorFlow implementation of WaveGAN (Donahue et al. 2018) (paper) (demo) (sound examples). WaveGAN is a machine learning algorithm which learns to synthesize raw waveform audio by observing many examples of real audio. WaveGAN is comparable to the popular DCGAN approach (Radford et al. 2016) for learning to generate images.

In this repository, we include an implementation of WaveGAN capable of learning to generate up to 4 seconds of audio at 16kHz. For comparison, we also include an implementation of SpecGAN, an approach to audio generation which applies image-generating GANs to image-like audio spectrograms.

WaveGAN is capable of learning to synthesize audio in many different sound domains. In the above figure, we visualize real and WaveGAN-generated audio of speech, bird vocalizations, drum sound effects, and piano excerpts. These sound examples and more can be heard here.

Requirements

pip install tensorflow-gpu==1.12.0
pip install scipy==1.0.0
pip install matplotlib==3.0.2
pip install librosa==0.6.2

Datasets

WaveGAN can now be trained on datasets of arbitrary audio files (previously required preprocessing). You can use any folder containing audio, but here are a few example datasets to help you get started:

Train a WaveGAN

Here is how you would begin (or resume) training a WaveGAN on random clips from a directory containing longer audio, i.e., more than a few seconds per file:

export CUDA_VISIBLE_DEVICES="0"
python train_wavegan.py train ./train \
	--data_dir ./data/dir_with_longer_audio_files

If you are instead training on datasets of short sound effects (e.g., SC09 or drum sound effects), you want to use this command:

export CUDA_VISIBLE_DEVICES="0"
python train_wavegan.py train ./train \
	--data_dir ./data/sc09/train \
	--data_first_slice \
	--data_pad_end \
	--data_fast_wav

Because our codebase buffers audio clips directly from files, it is important to change the data-related command line arguments to be appropriate for your dataset (see [#data-considerations](data considerations below)).

We currently do not support training on multiple GPUs. If your machine has multiple GPUs, make sure to set the CUDA_VISIBLE_DEVICES flag as shown above.

While you can technically train a WaveGAN on CPU, it is prohibitively slow and not recommended. If you do attempt this, add the flag --data_prefetch_gpu_num -1.

Data considerations

The WaveGAN training script is configured out-of-the-box to be appropriate for training on random slices from a directory containing longer audio files (e.g., songs, a common use case). If you want to train WaveGAN on shorter sound effects, you will almost likely want to use the flag --data_first_slice to only extract the first slice from each audio file. If your clips are extremely short (i.e., less than $16384$ samples each as in SC09), you will want to add --data_pad_end so that they get zero padded to fill the slice.

If your dataset consists exclusively of "standard" WAV files (16-bit signed PCM or 32-bit float), you can use the flag --data_fast_wav which will use scipy (faster) to decode your audio instead of librosa. This may slightly increase training speed.

If you want to change the generation length, set --data_slice_len to 16384, 32768, or 65536 to generate that many audio samples. If you choose a larger generation length, you will likely want to reduce the number of model parameters to train more quickly (e.g. --wavegan_dim 32). You can also adjust the sampling rate using --data_sample_rate which will effectively change the generation length.

If you have stereo (or multi-channel) audio, adjust --data_num_channels as needed. If you are modeling more than 2 channels, each audio file must have the exact number of channels specified.

If you want to normalize each audio file before training, set --data_normalize.

Quality considerations

If your results are too noisy, try adding a post-processing filter with --wavegan_genr_pp. You may also want to change the amount of or remove phase shuffle using --wavegan_disc_phaseshuffle 0. Increasing either the model size (--wavegan_dim) or filter length (--wavegan_kernel_len) may improve results but will increase training time.

Monitoring

To run a script that will dump a preview of fixed latent vectors at each checkpoint on the CPU

export CUDA_VISIBLE_DEVICES="-1"
python train_wavegan.py preview ./train

To back up checkpoints every hour (GAN training may occasionally collapse so it's good to have backups)

python backup.py ./train 60

To monitor training via tensorboard, use

tensorboard --logdir=./train

If you are training on the SC09 dataset, this command will (slowly) calculate inception score at each checkpoint

export CUDA_VISIBLE_DEVICES="-1"
python train_wavegan.py incept ./train

Train a SpecGAN

The primary focus of this repository is on WaveGAN, our raw audio generation method. For comparison, we also include an implementation of SpecGAN, an approach to generating audio by applying image-generating GANs on image-like audio spectrograms. This implementation only generates spectrograms of one second in length at 16khz.

Before training a SpecGAN, we must first compute mean and variance of each spectrogram bin to use for normalization. This may take a while(you can also measure these statistics on a subset of the data)

python train_specgan.py moments ./train \
	--data_dir ./data/dir_with_mp3s \
	--data_moments_fp ./train/moments.pkl

To begin (or resume) training on GPU:

python train_specgan.py train ./train \
	--data_dir ./data/dir_with_mp3s \
	--data_moments_fp ./train/moments.pkl

Monitoring

To run a script that will dump a preview of fixed latent vectors at each checkpoint on the CPU

export CUDA_VISIBLE_DEVICES="-1"
python train_specgan.py preview ./train \
	--data_moments_fp ./train/moments.pkl

To back up checkpoints every hour (GAN training will occasionally collapse)

python backup.py ./train 60

To monitor training via tensorboard, use

tensorboard --logdir=./train

If you are training on the SC09 dataset, this command will (slowly) calculate inception score at each checkpoint

export CUDA_VISIBLE_DEVICES="-1"
python train_specgan.py incept ./train \
	--data_moments_fp ./train/moments.pkl

Generation

The training scripts for both WaveGAN and SpecGAN create simple TensorFlow MetaGraphs for generating audio waveforms, located in the training directory. An example usage is below; see this Colab notebook for additional features.

import tensorflow as tf
from IPython.display import display, Audio

# Load the graph
tf.reset_default_graph()
saver = tf.train.import_meta_graph('infer.meta')
graph = tf.get_default_graph()
sess = tf.InteractiveSession()
saver.restore(sess, 'model.ckpt')

# Create 50 random latent vectors z
_z = (np.random.rand(50, 100) * 2.) - 1

# Synthesize G(z)
z = graph.get_tensor_by_name('z:0')
G_z = graph.get_tensor_by_name('G_z:0')
_G_z = sess.run(G_z, {z: _z})

# Play audio in notebook
display(Audio(_G_z[0, :, 0], rate=16000))

Evaluation

Our paper uses Inception score to (roughly) measure model performance. If you plan to directly compare to our reported numbers, you should run this script on a directory of 50,000 16-bit PCM WAV files with 16384 samples each.

python score.py --audio_dir wavs

To reproduce our paper results (9.18 +- 0.04) for the SC09 (download) training dataset, run

python score.py --audio_dir sc09/train  --fix_length --n 18620

Web code

Under web, we also include a JavaScript implementation of WaveGAN (generation only). Using this implementation, we created a procedural drum machine powered by a WaveGAN trained on drum sound effects.

Attribution

If you use this code in your research, cite via the following BibTeX:

@inproceedings{donahue2019wavegan,
  title={Adversarial Audio Synthesis},
  author={Donahue, Chris and McAuley, Julian and Puckette, Miller},
  booktitle={ICLR},
  year={2019}
}

More Repositories

1

nesmdb

The NES Music Database: use machine learning to compose music for the Nintendo Entertainment System!
Python
454
star
2

LakhNES

Generate 8-bit chiptunes with deep learning
Python
332
star
3

sheetsage

Transcribe music into lead sheets!
Python
288
star
4

ddc

Dance Dance Convolution dataset tools and models
Python
211
star
5

ilm

Easily fine tune GPT-2 to fill in missing text
Python
196
star
6

music-cocreation-tutorial

Start-to-finish tutorial for interactive music co-creation in PyTorch and Tensorflow.js
Jupyter Notebook
104
star
7

sdgan

Official implementation of "Semantically Decomposing the Latent Spaces of Generative Adversarial Networks"
Python
95
star
8

opengl_spectrogram

using JUCE to create a 3D spectrogram drawn with OpenGL
C++
39
star
9

neural-loops

Make musical loops in the browser using WaveGAN, GANSynth, and MusicVAE
JavaScript
34
star
10

ddc_onset

Music onset detector from Dance Dance Convolution packaged as a lightweight PyTorch module
Python
31
star
11

midi2key_linux

Simple script to convert MIDI inputs to hotkeys on linux
Shell
13
star
12

fall23-phd-prospectives

Info for prospective PhD students for Chris Donahue's lab at CMU starting Fall 23.
12
star
13

piano-transcribe-batch

Uses Magenta's Onsets and Frames piano transcription model to transcribe a batch of solo piano recordings
JavaScript
11
star
14

piano-genie-research-demo

This is the old Piano Genie demo. For a shiny new one, go to http://piano-genie.glitch.me
TypeScript
10
star
15

gdrive-wget

Generate wget commands for Google Drive links!
JavaScript
5
star
16

ject

JUCE extended convolution techniques GUI
C++
5
star
17

wavegan_examples

Sound examples for WaveGAN
HTML
3
star
18

gpsynth

audio synthesis using genetic programming
C++
2
star
19

chrisdonahue.github.io

Jupyter Notebook
1
star
20

advoc_examples

HTML
1
star
21

sheetsage-lbd

Sound examples for Sheet Sage at ISMIR 2021 late breaking demos https://archives.ismir.net/ismir2021/latebreaking/000049.pdf
TeX
1
star
22

js_audio_examples

repository for open source JS/Web Audio API computer music tutorials
JavaScript
1
star