DailyTalk: Spoken Dialogue Dataset for Conversational Text-to-Speech

Keon Lee^, Kyumin Park^, Daeyoung Kim

In our paper, we introduce DailyTalk, a high-quality conversational speech dataset designed for Text-to-Speech.

Abstract: The majority of current Text-to-Speech (TTS) datasets, which are collections of individual utterances, contain few conversational aspects. In this paper, we introduce DailyTalk, a high-quality conversational speech dataset designed for conversational TTS. We sampled, modified, and recorded 2,541 dialogues from the open-domain dialogue dataset DailyDialog inheriting its annotated attributes. On top of our dataset, we extend prior work as our baseline, where a non-autoregressive TTS is conditioned on historical information in a dialogue. From the baseline experiment with both general and our novel metrics, we show that DailyTalk can be used as a general TTS dataset, and more than that, our baseline can represent contextual information from DailyTalk. The DailyTalk dataset and baseline code are freely available for academic use with CC-BY-SA 4.0 license.

Dataset

You can download our dataset. Please refer to Statistic Details for details.

Pretrained Models

You can download our pretrained models. There are two different directories: 'history_none' and 'history_guo'. The former has no historical encodings so that it is not a conversational context-aware model. The latter has historical encodings following Conversational End-to-End TTS for Voice Agent (Guo et al., 2020).

Toggle the type of history encodings by

# In the model.yaml
history_encoder:
  type: "Guo" # ["none", "Guo"]

Quickstart

Dependencies

You can install the Python dependencies with

pip3 install -r requirements.txt

Also, Dockerfile is provided for Docker users.

Inference

You have to download both our dataset. Download pretrained models and put them in output/ckpt/DailyTalk/. Also unzip generator_LJSpeech.pth.tar or generator_universal.pth.tar in hifigan folder. The models are trained with unsupervised duration modeling under transformer building block and the history encoding types.

Only the batch inference is supported as the generation of a turn may need contextual history of the conversation. Try

python3 synthesize.py --source preprocessed_data/DailyTalk/val_*.txt --restore_step RESTORE_STEP --mode batch --dataset DailyTalk

to synthesize all utterances in preprocessed_data/DailyTalk/val_*.txt.

Training

Preprocessing

For a multi-speaker TTS with external speaker embedder, download ResCNN Softmax+Triplet pretrained model of philipperemy's DeepSpeaker for the speaker embedding and locate it in ./deepspeaker/pretrained_models/. Please note that our pretrained models are not trained with this (they are trained with speaker_embedder: "none").
Run
```
python3 prepare_align.py --dataset DailyTalk
```
for some preparations.

For the forced alignment, Montreal Forced Aligner (MFA) is used to obtain the alignments between the utterances and the phoneme sequences. Pre-extracted alignments for the datasets are provided here. You have to unzip the files in preprocessed_data/DailyTalk/TextGrid/. Alternately, you can run the aligner by yourself. Please note that our pretrained models are not trained with supervised duration modeling (they are trained with learn_alignment: True).

After that, run the preprocessing script by
```
python3 preprocess.py --dataset DailyTalk
```

Training

Train your model with

python3 train.py --dataset DailyTalk

Useful options:

To use a Automatic Mixed Precision, append --use_amp argument to the above command.
The trainer assumes single-node multi-GPU training. To use specific GPUs, specify CUDA_VISIBLE_DEVICES=<GPU_IDs> at the beginning of the above command.

TensorBoard

Use

tensorboard --logdir output/log

to serve TensorBoard on your localhost. The loss curves, synthesized mel-spectrograms, and audios are shown.

Notes

Convolutional embedding is used as StyleSpeech for phoneme-level variance in unsupervised duration modeling. Otherwise, bucket-based embedding is used as FastSpeech2.
Unsupervised duration modeling in phoneme-level will take longer time than frame-level since the additional computation of phoneme-level variance is activated at runtime.
Two options for embedding for the multi-speaker TTS setting: training speaker embedder from scratch or using a pre-trained philipperemy's DeepSpeaker model (as STYLER did). You can toggle it by setting the config (between 'none' and 'DeepSpeaker').
For vocoder, HiFi-GAN is used for all experiments in our paper.

Citation

If you would like to use our dataset and code or refer to our paper, please cite as follows.

@misc{lee2022dailytalk,
    title={DailyTalk: Spoken Dialogue Dataset for Conversational Text-to-Speech},
    author={Keon Lee and Kyumin Park and Daeyoung Kim},
    year={2022},
    eprint={2207.01063},
    archivePrefix={arXiv},
    primaryClass={eess.AS}
}

License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

keonlee9420/DailyTalk

keonlee9420

Reviews

Repository Details

DailyTalk: Spoken Dialogue Dataset for Conversational Text-to-Speech

Keon Lee^, Kyumin Park^, Daeyoung Kim

Dataset

Pretrained Models

Quickstart

Dependencies

Inference

Training

Preprocessing

Training

TensorBoard

Notes

Citation

License

References

More Repositories

keonlee9420/DailyTalk

keonlee9420

Reviews

Repository Details

DailyTalk: Spoken Dialogue Dataset for Conversational Text-to-Speech

Keon Lee*, Kyumin Park*, Daeyoung Kim

Dataset

Pretrained Models

Quickstart

Dependencies

Inference

Training

Preprocessing

Training

TensorBoard

Notes

Citation

License

References

More Repositories

Keon Lee^, Kyumin Park^, Daeyoung Kim