• Stars
    star
    129
  • Rank 279,262 (Top 6 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created over 2 years ago
  • Updated almost 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Scripts for computing the Intelligibility and CLVP scores for evaluating TTS models

TTS Scores - Better evaluation metrics for text to speech models

TTS quality is a difficult thing to measure. Distance-based metrics are poor measurements because they only measure similarity to the test set, not the realism of the generated speech. For this reason, most TTS papers rely on Mean Opinion Scores to report model quality. Computing MOS involves humans in the loop, meaning it is costly and time consuming. More importantly, it cannot be used while training to evaluate the real-time performance of a model while training.

The field of image generation has settled on the usage of the Frechet Inception Distance and Inception Score metrics to measure live performance. They are quite successful. I think we should take a page out of their book. But, we can modernize this a little:

Installation

tts-scores is available on pypi:

pip install tts-scores

Contrastive Language-Voice Pretrained model (CLVP)

To this end, I trained a CLIP-like architecture with a twist: instead of measuring the similarity of text and images, it measures the similarity of text and voice clips. I call this model CLVP. I believe such a model is an exceptional candidate for synthesizing a quality metric for Text->Voice models, much in the way that the Inception model is used for FID and IS scores.

This repo contains the source code for CLVP and scripts that allow you to use it. I have built two metrics:

CLVP Score

The CLVP score measures the distance predicted by CVLP between text and an audio clip where that text is spoken. A lower score is better. It can be obtained by:

from tts_scores.clvp import CLVPMetric
cv_metric = CLVPMetric(device='cuda')
score = cv_metric.compute_clvp('<path_to_your_tsv>', 'D:\\tmp\\tortoise-tts-eval\\real')

Note: the format of the TSV file is described in a later section

CLVP Frechet Distance

Similar to FID, this metric compares the distribution of real spoken text with whatever your TTS model generatets. It is particularly useful if you have a bunch of spoken text that you want to compare against but do not have the transcriptions for that text. For example, this is a good fit for measuring the performance of vocoders.

It works by computing the frechet distance of the outputs of the last layer of the CLVP model when fed data from both distributions. Similar to FID, a lower score is better. It can be obtained by:

from tts_scores.clvp import CLVPMetric
cv_metric = CLVPMetric(device='cuda')
score = cv_metric.compute_fd('<path_to_your_generated_audio>, '<path_to_your_real_audio>')

TSV format

The TSV input is a tab-separated-value file. Each line must contain a transcript followed by a tab, followed by a filename. It can be optionally followed by more tab separated values, only the first two are important:

<transcript1><|tab|><filename1><|tab|>....
<transcript2><|tab|><filename2><|tab|>....
...
<transcriptN><|tab|><filenameN><|tab|>....

wav2vec2 Intelligibility Score

One rather obvious way to compute the performance of a TTS system that I have not seen before is to leverage an ASR system. If the goal is to produce intelligible speech - why not use a speech recognition system to measure that intelligibility.

The intelligibility score packaged in this repo does exactly that. It takes in a list of generated and real audio files and their transcriptions, and feeds everything through a pre-trained wav2vec2 model. The raw losses are returned. The score is the difference between the wav2vec2 losses for the fake/generated samples and the real samples.

While CLVP scores take things like voice quality, voice diversity and prosody into account, the intelligibility score only considers whether or not the speech your TTS model generates maps coherently to the text you put into it. For some use cases, this will be the most important score. For others, all of the scores are important.

from tts_scores.intelligibility import IntelligibilityMetric
is_metric = IntelligibilityMetric(device='cuda')
score = is_metric.compute_intelligibility('<path_to_your_tsv>', '<path_to_your_real_audio>')

Scores from common models

A metric is only good if there are benchmarks which can be used as points of comparison. To this end, I computed all of the scores in this repo on two high-performance TTS models:

  1. Tacotron2+waveglow from NVIDIA's repo
  2. FastSpeech2+hifigan from ming024's repo

See the scores below:

Citations

Please cite this repo if you use it in your repo:

@software{TTS-scores,
  author = {Betker, J ames},
  month = {4},
  title = {{TTS-scores}},
  url = {https://github.com/neonbjb/tts-scores},
  version = {1.0.0},
  year = {2022}
}

More Repositories

1

tortoise-tts

A multi-voice TTS system trained with an emphasis on quality
Jupyter Notebook
12,761
star
2

ocotillo

Performant and accurate speech recognition built on Pytorch
Python
242
star
3

DL-Art-School

DLAS - A configuration-driven trainer for generative models
Python
134
star
4

BigListOfPodcasts

A list of podcast URLs scraped from the Apple podcast database in late 2021, including a script for downloading those podcasts.
Python
32
star
5

pyfastmp3decoder

A fast MP3 decoder for python, using minimp3
Cython
25
star
6

RaspPiArinc429

ARINC429 Driver Code for Raspberry Pi
Java
19
star
7

conveyer

A better data loading pipeline for training ML models
Python
9
star
8

mp_transformers

Implementation of an activation magnitude preserving transformer
Python
8
star
9

SwitchedConvolutions

A trainable layer that switches how ML blocks operate on images based on the contents of those images at the pixel level.
Python
5
star
10

audio_clip_processing_pipeline

Audio Clips Processing Pipeline
Python
5
star
11

transformers-tokenizer-java

A Java string tokenizer compatible with the popular huggingface transformers library
Java
3
star
12

JavaNI

Java Extensions and Gesture Recognition Sitting on OpenNI
C
3
star
13

fluvial

Awesome human photo super-resolution
Python
3
star
14

x-transformers-prod

A fork of x-transformers with modifications to make it suitable for production use
Python
3
star
15

spectracular

A high-quality neural spatial compression and decompression suite for music
2
star
16

torch-distributed-bench

Bench test torch.distributed
Python
2
star
17

MAV-Downlink

A MAVLink Interface App for Android Smartphones
Java
2
star
18

MAVDownlinkServer

Provides a server interface for the MAV Downlink Android Application
Java
2
star
19

quartz

An ultra-high compression voice quantizer
1
star
20

NonIntNLP

Non-interactive NLP - State of the art NLP for the masses
Python
1
star
21

tobii-mouse-winforms

C# application which accesses the Tobii StreamEngine API to provide mouse emulation functions.
C#
1
star