• Stars
    star
    349
  • Rank 121,528 (Top 3 %)
  • Language
    Python
  • License
    MIT License
  • Created over 6 years ago
  • Updated about 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Tensorflow implementation of "Generalized End-to-End Loss for Speaker Verification"

Speaker_Verification

Tensorflow implementation of Generalized End-to-End Loss for Speaker Verification (Kaggle, paperswithcode). This paper is based on the previous work End-to-End Text-Dependent Speaker Verification.

Speaker Verification

  • Speaker verification does 1-1 check between the enrolled voice and the new voice. This task requires to achieve the higher accuracy than speaker identification which does N-1 check between the N enrolled voices and a new voice.
  • There are two types of speaker verification: 1) Text dependent speaker verification (TD-SV). 2) Text independent speaker verification (TI-SV). The former uses the text-specific utterances for enrollment and verification, whereas the latter uses text-independent utterances.
  • At each forward step of the method, the utterance similarity matrix is calculated and the integrated loss is used for the objective function. (see Section 2.1 of the paper)

Files

  • configuration.py : Argument parsing
  • data_preprocess.py : Extracts noise and performs STFT on raw audio. For each raw audio, the voice activity detection is performed via librosa library.
  • utils.py : Contains various util functions for training and test.
  • model.py : Contains train and test functions.
  • main.py : After the dataset is prepared, run
python main.py --train True --model_path where_you_want_to_save                 # training
python main.py --train False --model_path model_path used at training phase     # test

Data

  • Note, The authors of the paper used their own private dataset, and I could not obtain it.
  • In this implementation, I used VTCK public dataset, CSTR VCTK Corpus and noise added VTCK dataset (from "Noisy speech database for training speech enhancement algorithms and TTS models").
  • The VCTK dataset includes speech data uttered by 109 native English speakers with various accents.
  • For TD-SV, I used the first audio file of each speaker, which is speaking "Call Stella". For the each training and test data, I added random noise extracted from the noise added VTCK dataset.
  • For TD-SI, I used randomly selected utterances from each speaker. The blanks of raw audio files are trimmed, and then slicing is performed.

Results

I trained the model with my notebook CPU. The model hyperpameters are following the paper:

  • 3 LSTM layers with 128 hidden nodes, 64 projection nodes (Total 210434 variables)
  • 0.01 lr sgd with 0.5 decay
  • l2 norm clipping by 3

To finish training and test in time, I used smaller batch (4 speakers x 5 utterances) than the paper. I used the first 85% of the dataset as training set and used the remained parts as the testset. In the below, I used softmax loss (however, the contrastive loss is also implemented in this code). On my environment, it takes less than 1s for calculating 40 utterances embedding.

  1. TD-SV
    For each utterance, random noise is added at each forward step. I tested a model after 60000 iteration. As a result, Equal Error Rate (EER) is 0, and we can see the model performs well with a small population.

The figure below contains a similarity matrix and its EER, FAR, and FRR. Here, each matrix corresponds to each speaker. If we call the first matrix as A (5x4), then A[i,j] means the cosine similarity between the first speaker's i^th vertification utterance and the j^th speaker's enrollment.

  1. TI-SV
    Randomly selected utterances are used. I tested the model after 60000 iteration. Here, Equal Error Rate (EER) is 0.09.

LICENSE

MIT License