• Stars
    star
    2,187
  • Rank 20,348 (Top 0.5 %)
  • Language
    Python
  • License
    Other
  • Created over 6 years ago
  • Updated 11 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

WaveNet vocoder

WaveNet vocoder

PyPI Build Status Build status DOI

NOTE: This is the development version. If you need a stable version, please checkout the v0.1.1.

The goal of the repository is to provide an implementation of the WaveNet vocoder, which can generate high quality raw speech samples conditioned on linguistic or acoustic features.

Audio samples are available at https://r9y9.github.io/wavenet_vocoder/.

News

Online TTS demo

A notebook supposed to be executed on https://colab.research.google.com is available:

Highlights

  • Focus on local and global conditioning of WaveNet, which is essential for vocoder.
  • 16-bit raw audio modeling by mixture distributions: mixture of logistics (MoL), mixture of Gaussians, and single Gaussian distributions are supported.
  • Various audio samples and pre-trained models
  • Fast inference by caching intermediate states in convolutions. Similar to arXiv:1611.09482
  • Integration with ESPNet (https://github.com/espnet/espnet)

Pre-trained models

Note: This is not itself a text-to-speech (TTS) model. With a pre-trained model provided here, you can synthesize waveform given a mel spectrogram, not raw text. You will need mel-spectrogram prediction model (such as Tacotron2) to use the pre-trained models for TTS.

Note: As for the pretrained model for LJSpeech, the model was fine-tuned multiple times and trained for more than 1000k steps in total. Please refer to the issues (#1, #75, #45) to know how the model was trained.

Model URL Data Hyper params URL Git commit Steps
link LJSpeech link 2092a64 1000k~ steps
link CMU ARCTIC link b1a1076 740k steps

To use pre-trained models, first checkout the specific git commit noted above. i.e.,

git checkout ${commit_hash}

And then follows "Synthesize from a checkpoint" section in the README. Note that old version of synthesis.py may not accept --preset=<json> parameter and you might have to change hparams.py according to the preset (json) file.

You could try for example:

# Assuming you have downloaded LJSpeech-1.1 at ~/data/LJSpeech-1.1
# pretrained model (20180510_mixture_lj_checkpoint_step000320000_ema.pth)
# hparams (20180510_mixture_lj_checkpoint_step000320000_ema.json)
git checkout 2092a64
python preprocess.py ljspeech ~/data/LJSpeech-1.1 ./data/ljspeech \
  --preset=20180510_mixture_lj_checkpoint_step000320000_ema.json
python synthesis.py --preset=20180510_mixture_lj_checkpoint_step000320000_ema.json \
  --conditional=./data/ljspeech/ljspeech-mel-00001.npy \
  20180510_mixture_lj_checkpoint_step000320000_ema.pth \
  generated

You can find a generated wav file in generated directory. Wonder how it works? then take a look at code:)

Repository structure

The repository consists of 1) pytorch library, 2) command line tools, and 3) ESPnet-style recipes. The first one is a pytorch library to provide WavaNet functionality. The second one is a set of tools to run WaveNet training/inference, data processing, etc. The last one is the reproducible recipes combining the WaveNet library and utility tools. Please take a look at them depending on your purpose. If you want to build your WaveNet on your dataset (I guess this is the most likely case), the recipe is the way for you.

Requirements

  • Python 3
  • CUDA >= 8.0
  • PyTorch >= v0.4.0

Installation

git clone https://github.com/r9y9/wavenet_vocoder && cd wavenet_vocoder
pip install -e .

If you only need the library part, you can install it from pypi:

pip install wavenet_vocoder

Getting started

Kaldi-style recipes

The repository provides Kaldi-style recipes to make experiments reproducible and easily manageable. Available recipes are as follows:

  • mulaw256: WaveNet that uses categorical output distribution. The input is 8-bit mulaw quantized waveform.
  • mol: Mixture of Logistics (MoL) WaveNet. The input is 16-bit raw audio.
  • gaussian: Single-Gaussian WaveNet (a.k.a. teacher WaveNet of ClariNet). The input is 16-bit raw audio.

All the recipe has run.sh, which specifies all the steps to perform WaveNet training/inference including data preprocessing. Please see run.sh in egs directory for details.

NOTICE: Global conditioning for multi-speaker WaveNet is not supported in the above recipes (it shouldn't be difficult to implement though). Please check v0.1.12 for the feature, or if you really need the feature, please raise an issue.

Apply recipe to your own dataset

The recipes are designed to be generic so that one can use them for any dataset. To apply recipes to your own dataset, you'd need to put all the wav files in a single flat directory. i.e.,

> tree -L 1 ~/data/LJSpeech-1.1/wavs/ | head
/Users/ryuichi/data/LJSpeech-1.1/wavs/
β”œβ”€β”€ LJ001-0001.wav
β”œβ”€β”€ LJ001-0002.wav
β”œβ”€β”€ LJ001-0003.wav
β”œβ”€β”€ LJ001-0004.wav
β”œβ”€β”€ LJ001-0005.wav
β”œβ”€β”€ LJ001-0006.wav
β”œβ”€β”€ LJ001-0007.wav
β”œβ”€β”€ LJ001-0008.wav
β”œβ”€β”€ LJ001-0009.wav

That's it! The last step is to modify db_root in run.sh or give db_root as the command line argment for run.sh.

./run.sh --stage 0 --stop-stage 0 --db-root ~/data/LJSpeech-1.1/wavs/

Step-by-step

A recipe typically consists of multiple steps. It is strongly recommended to run the recipe step-by-step to understand how it works for the first time. To do so, specify stage and stop_stage as follows:

./run.sh --stage 0 --stop-stage 0
./run.sh --stage 1 --stop-stage 1
./run.sh --stage 2 --stop-stage 2

In typical situations, you'd need to specify CUDA devices explciitly expecially for training step.

CUDA_VISIBLE_DEVICES="0,1" ./run.sh --stage 2 --stop-stage 2

Docs for command line tools

Command line tools are writtern with docopt. See each docstring for the basic usages.

tojson.py

Dump hyperparameters to a json file.

Usage:

python tojson.py --hparams="parameters you want to override" <output_json_path>

preprocess.py

Usage:

python preprocess.py wavallin ${dataset_path} ${out_dir} --preset=<json>

train.py

Note: for multi gpu training, you have better ensure that batch_size % num_gpu == 0

Usage:

python train.py --dump-root=${dump-root} --preset=<json>\
  --hparams="parameters you want to override"

evaluate.py

Given a directoy that contains local conditioning features, synthesize waveforms for them.

Usage:

python evaluate.py ${dump_root} ${checkpoint} ${output_dir} --dump-root="data location"\
    --preset=<json> --hparams="parameters you want to override"

Options:

  • --num-utterances=<N>: Number of utterances to be generated. If not specified, generate all uttereances. This is useful for debugging.

synthesis.py

NOTICE: This is probably not working now. Please use evaluate.py instead.

Synthesize waveform give a conditioning feature.

Usage:

python synthesis.py ${checkpoint_path} ${output_dir} --preset=<json> --hparams="parameters you want to override"

Important options:

  • --conditional=<path>: (Required for conditional WaveNet) Path of local conditional features (.npy). If this is specified, number of time steps to generate is determined by the size of conditional feature.

Training scenarios

Training un-conditional WaveNet

NOTICE: This is probably not working now. Please check v0.1.1 for the working version.

python train.py --dump-root=./data/cmu_arctic/
    --hparams="cin_channels=-1,gin_channels=-1"

You have to disable global and local conditioning by setting gin_channels and cin_channels to negative values.

Training WaveNet conditioned on mel-spectrogram

python train.py --dump-root=./data/cmu_arctic/ --speaker-id=0 \
    --hparams="cin_channels=80,gin_channels=-1"

Training WaveNet conditioned on mel-spectrogram and speaker embedding

NOTICE: This is probably not working now. Please check v0.1.1 for the working version.

python train.py --dump-root=./data/cmu_arctic/ \
    --hparams="cin_channels=80,gin_channels=16,n_speakers=7"

Misc

Monitor with Tensorboard

Logs are dumped in ./log directory by default. You can monitor logs by tensorboard:

tensorboard --logdir=log

List of papers that used the repository

Thank you very much!! If you find a new one, please submit a PR.

Sponsors

References

More Repositories

1

deepvoice3_pytorch

PyTorch implementation of convolutional neural networks-based text-to-speech synthesis models
Python
1,852
star
2

gantts

PyTorch implementation of GAN-based text-to-speech synthesis and voice conversion (VC)
Jupyter Notebook
508
star
3

pysptk

A python wrapper for Speech Signal Processing Toolkit (SPTK).
Python
412
star
4

nnmnkwii

Library to build speech synthesis systems designed for easy and fast prototyping.
Python
382
star
5

tacotron_pytorch

PyTorch implementation of Tacotron speech synthesis model.
Jupyter Notebook
286
star
6

ttslearn

ttslearn: Library for Pythonで学ぢ音声合成 (Text-to-speech with Python)
Jupyter Notebook
219
star
7

pyopenjtalk

Python wrapper for OpenJTalk
Python
143
star
8

pylibfreenect2

A python interface for libfreenect2
Python
131
star
9

SPTK

A modified version of Speech Signal Processing Toolkit (SPTK)
C
85
star
10

nnmnkwii_gallery

A collection of examples demonstrating how we can build speech synthesis systems using nnmnkwii.
Jupyter Notebook
70
star
11

gossp

Speech Signal Processing for Go (not maintained)
Go
67
star
12

pyreaper

A python wrapper for REAPER
Cython
64
star
13

sinsy

A fork of sinsy: HMM/DNN-based singing voice synthesis system
C++
57
star
14

pysinsy

Python wrapper for Sinsy
Python
47
star
15

open_jtalk

A fork of open_jtalk
C++
42
star
16

jsut-lab

HTS-style full-context labels for JSUT v1.1
40
star
17

VoiceConversion.jl

[Deprecated] Statistical Voice Conversion in Julia. See the website link for new library
Julia
37
star
18

icassp2020-espnet-tts-merlin-baseline

ICASSP 2020 ESPnet-TTS: Merlin baseline system
Jupyter Notebook
35
star
19

nnet

A small collection of neural network algorithms in Go (no longer maintained)
Go
29
star
20

WORLD.jl

A lightweight julia wrapper for WORLD - a high-quality speech analysis, modification and synthesis system
Julia
27
star
21

MelGeneralizedCepstrums.jl

Mel-Generalized Cepstrum analysis
Julia
19
star
22

nlp100

Assignments for NLP 100
Python
18
star
23

hts_engine_API

A fork of hts_engine_API
C
17
star
24

bayesian-kalmanfilter

Variational Baysian Kalman Filter
Python
16
star
25

WORLD

A modified version of WORLD (original: http://ml.cs.yamanashi.ac.jp/world/english/index.html)
C++
14
star
26

Colaboratory

Colaboratory notebooks
Jupyter Notebook
14
star
27

SynthesisFilters.jl

Speech waveform synthesis filters
Julia
13
star
28

SPTK.jl

A thin Julia wrapper for Speech Signal Processing Toolkit (SPTK) API
Julia
11
star
29

robust_pca

Robust Principal Component Analysis
C++
10
star
30

ConstantQ.jl

A fast constant-q transform in Julia
Julia
9
star
31

demos

Deprecated. See https://github.com/r9y9/website
HTML
6
star
32

naive_bayes

Naive Bayes implementation with digit recognition sample
Python
6
star
33

kiritan_singing_extra

Extra resources derived from https://github.com/mmorise/kiritan_singing for DNN-based singing voice synthesis
6
star
34

go-world

Go port to WORLD - a high-quality speech analysis, modification and synthesis system.
Go
6
star
35

stav

Statistical voice conversion written in Go for signal processing backend, Python for model training and parameter conversions
Python
6
star
36

VCTK-lab

Full context labels for VCTK corpus extracted by Merlin & speech tools
6
star
37

RobustPCA.jl

Robust Principal Component Analysis in Julia
Julia
5
star
38

REAPER.jl

A Julia interface for REAPER (Robust Epoch And Pitch EstimatoR)
Julia
5
star
39

julia-nmf-ss-toy

NMF-based Music Source Separation Demo in Julia
Julia
5
star
40

dotfiles

Dotfiles
Shell
4
star
41

Libfreenect2.jl

A Julia wrapper for libfreenect2
Julia
4
star
42

blog

Deprecated. See https://github.com/r9y9/website instead.
HTML
4
star
43

r9y9.github.io

My website
HTML
4
star
44

BNMF.jl

Bayesian Non-negative Matrix Factorization
Jupyter Notebook
4
star
45

fft

A simple implementation of Fast Fourier Transform (FFT)
C
3
star
46

ita-lab

HTS-style full-context labels for ITAコーパス γƒžγƒ«γƒγƒ’γƒΌγƒ€γƒ«γƒ‡γƒΌγ‚Ώγƒ™γƒΌγ‚Ή
Python
2
star
47

sandbox

2
star
48

media-player-demo

A Media player demonstration using Qt multimedia
C++
2
star
49

docker-pytorch-apex

Docker files for Pytorch + Apex
Dockerfile
2
star
50

SpeechBase.jl

Please do not use this package. SpeechBase.jl still need to be carefully designed.
Julia
2
star
51

FHMMs.jl

proof of concept
Julia
1
star
52

commonvoice-lab

HTS style full-context labels for common voice
1
star
53

setup

Setup script for Linux system
Shell
1
star
54

mnist

Go bindings for MNIST
Go
1
star
55

HMMs.jl

Hidden Markov Models in Julia
Julia
1
star
56

demos-src

Deprecated. See https://github.com/r9y9/website
HTML
1
star
57

pcdio_test

C++
1
star
58

HTSEngineAPI.jl

A Julia wrapper for hts_engine_API
Julia
1
star
59

python-neural-net-toy-codes

Feed forward Neural Networks with XOR and MNIST examples
Python
1
star
60

css10-lab

HTS-style full-context labels for CSS 10 Ja corpus
1
star
61

svdd2024seg

Python
1
star