• Stars
    star
    262
  • Rank 156,136 (Top 4 %)
  • Language
    Python
  • License
    BSD 3-Clause "New...
  • Created about 5 years ago
  • Updated over 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

VCTK multi-speaker tacotron for ICASSP 2020

multi-speaker-tacotron

This is an implementation of our paper from ICASSP 2020:
"Zero-Shot Multi-Speaker Text-To-Speech with State-of-the-art Neural Speaker Embeddings," by Erica Cooper, Cheng-I Lai, Yusuke Yasuda, Fuming Fang, Xin Wang, Nanxin Chen, and Junichi Yamagishi.
https://arxiv.org/abs/1910.10838
Please cite this paper if you use this code.

Audio samples can be found here: https://nii-yamagishilab.github.io/samples-multi-speaker-tacotron/

News:

  • 2022-03-29: Migrated data from Dropbox to Zenodo.
  • 2021-06-21: Added scripts for creating tfrecords to synthesize new texts using pretrained models. See directory synthesize_new_texts and its README.
  • 2020-08-10: Added example scripts for our new paper accepted to Interspeech 2020, "Can Speaker Augmentation Improve Multi-Speaker End-to-End TTS?" See directory is20 and please also update your copies of tacotron2 and self-attention-tacotron repositories as these contain some necessary changes.

Dependencies:

It is recommended to set up a miniconda environment for using Tacotron. https://repo.anaconda.com

conda create -n taco python=3.6.8
conda activate taco
conda install tensorflow-gpu scipy matplotlib docopt hypothesis pyspark unidecode
conda install -c conda-forge librosa
pip install inflect pysptk

Install this repository

git clone https://github.com/nii-yamagishilab/multi-speaker-tacotron.git external/multi_speaker_tacotron

Install Tacotron dependencies if you don't have them already:

mkdir external
git clone https://github.com/nii-yamagishilab/tacotron2.git external/tacotron2
git clone https://github.com/nii-yamagishilab/self-attention-tacotron.git external/self_attention_tacotron

Note the renaming of hyphens to underscores; this is necessary because “-” is an invalid character in Python.

Next, download project data and models, from the dropbox folder here: https://www.dropbox.com/sh/rq4lebus0n8tmso/AACldbmKDPRN9YiXrRROjtTSa?dl=0 The data has been moved to Zenodo. You can find it here: https://zenodo.org/record/6349897#.YkKR-C8Rr0o

  • Preprocessed VCTK data: in the data directory
  • VCTK Tacotron models: in the tacotron-models directory
  • VCTK Wavenet models: in the wavenet-models directory

Training from scratch using the VCTK data only is possible using the script train_from_scratch.sh; this does not require the Nancy pre-trained model which due to licensing restrictions we are unable to share.

To use our pre-trained WaveNet models, you will also need our WaveNet implementation which can be found here: https://github.com/nii-yamagishilab/project-CURRENNT-scripts

To obtain embeddings for new samples, you will need the neural speaker embedding code which can be found here: https://github.com/jefflai108/pytorch-kaldi-neural-speaker-embeddings

How to use

See the scripts warmup.sh (warm start training), train_from_scratch.sh (train on VCTK data only), and predictmel.sh (prediction). The scripts assume a SLURM-type computing environment. You will need to change the paths to match your environments and point to your data. Here are the parameters relevant to multi-speaker TTS:

  • source-data-root and target-data-root: path to your source and target preprocessed data
  • selected-list-dir: train/eval/test set definitions
  • batch_size: if you get OOM errors, try reducing the batch size
  • use_external_speaker_embedding=True: use speaker embeddings that you provide from a file (see the files in the speaker_embeddings directory)
  • embedding_file: path to the file containing your speaker embeddings
  • speaker_embedding_dim: dimension should match the dimension in your embedding file
  • speaker_embedding_projection_out_dim=64: We found experimentally that projecting the speaker embedding to a lower dimension helped to reduce overfitting. You can try different values, but to use our pretrained multi-speaker models you will have to use 64.
  • speaker_embedding_offset: must match the ID of your first speaker.

The scripts are set up using embedding_file="vctk-x-vector.txt",speaker_embedding_dim='200' which is default x-vectors. Please change it to embedding_file="vctk-lde-3.txt",speaker_embedding_dim='512' to use LDE embeddings from our best system.

Acknowledgments

This work was partially supported by a JST CREST Grant (JPMJCR18A6, VoicePersonae project), Japan, and by MEXT KAKENHI Grants (16H06302, 17H04687, 18H04120, 18H04112, 18KT0051, 19K24372), Japan. The numerical calculations were carried out on the TSUBAME 3.0 supercomputer at the Tokyo Institute of Technology.

Licence

BSD 3-Clause License

Copyright (c) 2020, Yamagishi Laboratory, National Institute of Informatics All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

  • Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

  • Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

  • Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

More Repositories

1

project-NN-Pytorch-scripts

see README
Python
275
star
2

Capsule-Forensics-v2

Implementation of the Capsule-Forensics-v2
Python
114
star
3

self-attention-tacotron

An implementation of "Investigation of enhanced Tacotron text-to-speech synthesis systems with self-attention for pitch accent language" https://arxiv.org/abs/1810.11960
Python
113
star
4

ZMM-TTS

ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations
C
96
star
5

tacotron2

An implementation of Tacotron and Tacotron2
Python
81
star
6

project-CURRENNT-public

CURRENNNT codes and scripts
Cuda
76
star
7

ClassNSeg

Implementation and demonstration of the paper: Multi-task Learning for Detecting and Segmenting Manipulated Facial Images and Videos
Python
75
star
8

project-CURRENNT-scripts

This repository contains the scripts to use CURRENNT
Python
64
star
9

mos-finetune-ssl

Python
63
star
10

Extended_VQVAE

Python
59
star
11

Intelligibility-MetricGAN

Implementation for paper "iMetricGAN: Intelligibility Enhancement for Speech-in-Noise using Generative Adversarial Network-based Metric Learning"
Python
51
star
12

VCC2020-database

49
star
13

Attention_Backend_for_ASV

Attention Backend for Aotumatic Speaker Verification with Multiple Enrollment Utterances
Python
45
star
14

TSNetVocoder

Python
42
star
15

Capsule-Forensics

Old implementation and demonstration of the Capsule-Forensics. The Capsule-Forensics-v2 has been released here: https://github.com/nii-yamagishilab/capsule-forensics-v2
Python
31
star
16

midi-to-audio

Project for MIDI to Audio Synthesis
Shell
19
star
17

vctk-silence-labels

19
star
18

NELE-GAN

Implementation for paper: Multi-Metric Optimization using Generative Adversarial Networks for Near-End Speech Intelligibility Enhancement
Python
18
star
19

PartialSpoof

Jupyter Notebook
17
star
20

speaker_sex_attribute_privacy

Project for HIDING SPEAKER’S SEX IN SPEECH USING ZERO-EVIDENCE SPEAKER REPRESENTATION IN AN ANALYSIS/SYNTHESIS PIPELINE
Python
14
star
21

SSL-SAS

Language independent SSL-based Speaker Anonymization system
Python
11
star
22

ssnt-tts

An implementation of SSNT-TTS.
Python
6
star
23

mla

A Multi-Level Attention Model for Evidence-Based Fact Checking
Python
4
star
24

downloader-DR-VCTK-complete

downloader to obtain the complete DR-VCTK dataset (250GB)
Python
4
star
25

Modular-CNN-for-CGIs-PIs-discrimination

Python
2
star
26

ewc

Python
2
star
27

fashion_adv

Fashion-Guided Adversarial Attack on Person Segmentation
Python
2
star
28

partial_rank_similarity

Jupyter Notebook
2
star
29

VCC2020-listeningtest

1
star
30

xfever

Shell
1
star
31

Generalization_of_CMs_regularizations

The source code for the paper Improving Generalization Ability of Countermeasures for New Mismatch Scenario by Combining Multiple Advanced Regularization Terms (interspeech2023)
Python
1
star