• Stars
    star
    194
  • Rank 200,219 (Top 4 %)
  • Language
    Python
  • License
    MIT License
  • Created over 2 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Official repository of DailyTalk: Spoken Dialogue Dataset for Conversational Text-to-Speech, ICASSP 2023

Hits

DailyTalk: Spoken Dialogue Dataset for Conversational Text-to-Speech

Keon Lee*, Kyumin Park*, Daeyoung Kim

In our paper, we introduce DailyTalk, a high-quality conversational speech dataset designed for Text-to-Speech.

Abstract: The majority of current Text-to-Speech (TTS) datasets, which are collections of individual utterances, contain few conversational aspects. In this paper, we introduce DailyTalk, a high-quality conversational speech dataset designed for conversational TTS. We sampled, modified, and recorded 2,541 dialogues from the open-domain dialogue dataset DailyDialog inheriting its annotated attributes. On top of our dataset, we extend prior work as our baseline, where a non-autoregressive TTS is conditioned on historical information in a dialogue. From the baseline experiment with both general and our novel metrics, we show that DailyTalk can be used as a general TTS dataset, and more than that, our baseline can represent contextual information from DailyTalk. The DailyTalk dataset and baseline code are freely available for academic use with CC-BY-SA 4.0 license.

Dataset

You can download our dataset. Please refer to Statistic Details for details.

Pretrained Models

You can download our pretrained models. There are two different directories: 'history_none' and 'history_guo'. The former has no historical encodings so that it is not a conversational context-aware model. The latter has historical encodings following Conversational End-to-End TTS for Voice Agent (Guo et al., 2020).

Toggle the type of history encodings by

# In the model.yaml
history_encoder:
  type: "Guo" # ["none", "Guo"]

Quickstart

Dependencies

You can install the Python dependencies with

pip3 install -r requirements.txt

Also, Dockerfile is provided for Docker users.

Inference

You have to download both our dataset. Download pretrained models and put them in output/ckpt/DailyTalk/. Also unzip generator_LJSpeech.pth.tar or generator_universal.pth.tar in hifigan folder. The models are trained with unsupervised duration modeling under transformer building block and the history encoding types.

Only the batch inference is supported as the generation of a turn may need contextual history of the conversation. Try

python3 synthesize.py --source preprocessed_data/DailyTalk/val_*.txt --restore_step RESTORE_STEP --mode batch --dataset DailyTalk

to synthesize all utterances in preprocessed_data/DailyTalk/val_*.txt.

Training

Preprocessing

  • For a multi-speaker TTS with external speaker embedder, download ResCNN Softmax+Triplet pretrained model of philipperemy's DeepSpeaker for the speaker embedding and locate it in ./deepspeaker/pretrained_models/. Please note that our pretrained models are not trained with this (they are trained with speaker_embedder: "none").

  • Run

    python3 prepare_align.py --dataset DailyTalk
    

    for some preparations.

    For the forced alignment, Montreal Forced Aligner (MFA) is used to obtain the alignments between the utterances and the phoneme sequences. Pre-extracted alignments for the datasets are provided here. You have to unzip the files in preprocessed_data/DailyTalk/TextGrid/. Alternately, you can run the aligner by yourself. Please note that our pretrained models are not trained with supervised duration modeling (they are trained with learn_alignment: True).

    After that, run the preprocessing script by

    python3 preprocess.py --dataset DailyTalk
    

Training

Train your model with

python3 train.py --dataset DailyTalk

Useful options:

  • To use a Automatic Mixed Precision, append --use_amp argument to the above command.
  • The trainer assumes single-node multi-GPU training. To use specific GPUs, specify CUDA_VISIBLE_DEVICES=<GPU_IDs> at the beginning of the above command.

TensorBoard

Use

tensorboard --logdir output/log

to serve TensorBoard on your localhost. The loss curves, synthesized mel-spectrograms, and audios are shown.

Notes

  • Convolutional embedding is used as StyleSpeech for phoneme-level variance in unsupervised duration modeling. Otherwise, bucket-based embedding is used as FastSpeech2.
  • Unsupervised duration modeling in phoneme-level will take longer time than frame-level since the additional computation of phoneme-level variance is activated at runtime.
  • Two options for embedding for the multi-speaker TTS setting: training speaker embedder from scratch or using a pre-trained philipperemy's DeepSpeaker model (as STYLER did). You can toggle it by setting the config (between 'none' and 'DeepSpeaker').
  • For vocoder, HiFi-GAN is used for all experiments in our paper.

Citation

If you would like to use our dataset and code or refer to our paper, please cite as follows.

@misc{lee2022dailytalk,
    title={DailyTalk: Spoken Dialogue Dataset for Conversational Text-to-Speech},
    author={Keon Lee and Kyumin Park and Daeyoung Kim},
    year={2022},
    eprint={2207.01063},
    archivePrefix={arXiv},
    primaryClass={eess.AS}
}

License

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

References

More Repositories

1

PortaSpeech

PyTorch Implementation of PortaSpeech: Portable and High-Quality Generative Text-to-Speech
Python
329
star
2

Comprehensive-Transformer-TTS

A Non-Autoregressive Transformer based Text-to-Speech, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate TTS
Python
319
star
3

DiffGAN-TTS

PyTorch Implementation of DiffGAN-TTS: High-Fidelity and Efficient Text-to-Speech with Denoising Diffusion GANs
Python
310
star
4

Expressive-FastSpeech2

PyTorch Implementation of Non-autoregressive Expressive (emotional, conversational) TTS based on FastSpeech2, supporting English, Korean, and your own languages.
Python
276
star
5

DiffSinger

PyTorch implementation of DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism (focused on DiffSpeech)
Python
233
star
6

StyleSpeech

PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation
Python
189
star
7

Parallel-Tacotron2

PyTorch Implementation of Google's Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling
Python
187
star
8

Cross-Speaker-Emotion-Transfer

PyTorch Implementation of ByteDance's Cross-speaker Emotion Transfer Based on Speaker Condition Layer Normalization and Semi-Supervised Training in Text-To-Speech
Python
180
star
9

STYLER

Official repository of STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech, INTERSPEECH 2021
Python
156
star
10

Comprehensive-E2E-TTS

A Non-Autoregressive End-to-End Text-to-Speech (text-to-wav), supporting a family of SOTA unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate E2E-TTS
Python
144
star
11

Soft-DTW-Loss

PyTorch implementation of Soft-DTW: a Differentiable Loss Function for Time-Series in CUDA
Python
120
star
12

VAENAR-TTS

PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.
Python
71
star
13

FastPitchFormant

PyTorch Implementation of NCSOFT's FastPitchFormant: Source-filter based Decomposed Modeling for Speech Synthesis
Python
71
star
14

WaveGrad2

PyTorch Implementation of Google Brain's WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis
Python
66
star
15

Daft-Exprt

PyTorch Implementation of Daft-Exprt: Robust Prosody Transfer Across Speakers for Expressive Speech Synthesis
Python
56
star
16

Comprehensive-Tacotron2

PyTorch Implementation of Google's Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. This implementation supports both single-, multi-speaker TTS and several techniques to enforce the robustness and efficiency of the model.
Python
45
star
17

evaluate-zero-shot-tts

Evaluation Protocol for Large-Scale Zero-Shot TTS Literature
Python
43
star
18

Robust_Fine_Grained_Prosody_Control

PyTorch Implementation of Robust and fine-grained prosody control of end-to-end speech synthesis
Python
41
star
19

Stepwise_Monotonic_Multihead_Attention

PyTorch Implementation of Stepwise Monotonic Multihead Attention similar to Enhancing Monotonicity for Robust Autoregressive Transformer TTS
Python
31
star
20

Deep-Learning-TTS-Template

This is a template for the Non-autoregressive Deep Learning-Based TTS model (in PyTorch).
Python
15
star
21

tacotron2_MMI

Another PyTorch implementation of Tacotron2 MMI (with waveglow) which supports n_frames_per_step>1 mode(reduction windows) and diagonal guided attention for robust alignments.
Jupyter Notebook
5
star
22

Fully_Hierarchical_Fine_Grained_TTS

Pytorch Implementation of Fully-hierarchical fine-grained prosody modeling for interpretable speech synthesis (Unofficial)
2
star
23

cs231n

cs231n 2020 Spring assignments implementation
Jupyter Notebook
2
star
24

pintos

KAIST CS330 OS pintos Project
HTML
1
star