• Stars
    star
    1,783
  • Rank 26,093 (Top 0.6 %)
  • Language
    Python
  • License
    GNU Affero Genera...
  • Created almost 2 years ago
  • Updated 5 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Multilingual Automatic Speech Recognition with word-level timestamps and confidence

whisper-timestamped

Multilingual Automatic Speech Recognition with word-level timestamps and confidence.

Description

Whisper is a set of multi-lingual robust speech recognition models, trained by OpenAI, that achieve state-of-the-art in many languages. Whisper models were trained to predict approximative timestamps on speech segments (most of the times with 1 sec accuracy), but cannot originally predict word timestamps. This repository proposes an implementation to predict word timestamps, and give more accurate estimation of speech segments, when transcribing with Whipser models. Besides, a confidence score is assigned to each word and each segment (both computed as "exp(mean(log probas))" on the probabilities of subword tokens).

The approach is based on approach Dynamic Time Warping (DTW) applied to cross-attention weights, as done by this notebook by Jong Wook Kim. There are some additions to this notebook:

  • The start/end estimation is more accurate.
  • Confidence scores are assigned to each word.
  • If possible (without beam search...), there no additional inference steps are required to predict word timestamps (word alignment is done on the fly, after each speech segment is decoded).
  • There is a special care about memory usage: whisper-timestamped is able to process long files, with little additional memory with respect to the regular use of Whisper model.

whisper-timestamped is an extension of openai-whisper python package and is meant to compatible with any version of openai-whisper.

Notes on other approaches

An alternative relevant approach to recover word-level timestamps consists in using wav2vec models that predict characters, as successfully implemented in whisperX. But these approaches have several drawbacks, which does not have approaches based on cross-attention weights such as whisper_timestamped. These drawbacks are:

  • The need to find one wav2vec model per language to support, which badly scales to the multi-lingual capabilities of Whisper.
  • The need to handle (at least) one additional neural network (wav2vec model), which consumes memory.
  • The need to normalize characters in whisper transcription to match the character set of wav2vec model. This involves awkward language-dependent conversions, like converting numbers to words ("2" -> "two"), symbols to words ("%" -> "percent", "€" -> "euro(s)")...
  • The lack of robustness around speech disfluencies (fillers, hesitations, repeated words...) that are usually removed by Whisper.

An alternative approach, that does not require an additional model, is to look at the probabilities of timestamp tokens estimated by the Whisper model after each (sub)word token is predicted. It was implemented for instance in whisper.cpp and stable-ts. But this approach lacks of robustness, because Whisper models do not have been trained to output meaningful timestamps after each word. Whisper models tend to predict timestamps only after a certain number of words have been predicted (typically at the end of a sentence), and the probability distribution of timestamps outside this condition may be inaccurate. In practice, these methods can produce results that are totally out-of-sync on some periods of time (we observed that especially when there is jingle music). Also the timestamp precision of Whisper models tend to be rounded to 1 second (as in many video subtitles), which is too inaccurate for words, and reaching a better accuracy is tricky.

Installation

First installation

Requirements:

  • python3 (version higher or equal to 3.7, at least 3.9 is recommended)
  • ffmpeg (see instructions for installation on the whisper repository

You can install whisper-timestamped either by using pip:

pip3 install git+https://github.com/linto-ai/whisper-timestamped

or by cloning this repository and running installation:

git clone https://github.com/linto-ai/whisper-timestamped
cd whisper-timestamped/
python3 setup.py install

Additional packages that might be needed

If you want to plot alignement between audio timestamps and words (as in this section), you also need matplotlib

pip3 install matplotlib

If you want to use VAD option (Voice Activity Detection before running Whisper model), you also need torchaudio and onnxruntime

pip3 install onnxruntime torchaudio

If you want to use finetuned Whisper models from the Hugging Face Hub, you also need transformers

pip3 install transformers

Docker

A docker image of about 9GB can be built using:

git clone https://github.com/linto-ai/whisper-timestamped
cd whisper-timestamped/
docker build -t whisper_timestamped:latest .

Light installation for CPU

If you don't have GPU (or don't want to use it), then you don't need to install CUDA dependencies. You should then just install a light version of torch before installing whisper-timestamped, for instance as follows:

pip3 install \
     torch==1.13.1+cpu \
     torchaudio==0.13.1+cpu \
     -f https://download.pytorch.org/whl/torch_stable.html

A specific docker image of about 3.5GB can also be built using:

git clone https://github.com/linto-ai/whisper-timestamped
cd whisper-timestamped/
docker build -t whisper_timestamped_cpu:latest -f Dockerfile.cpu .

Upgrade to the latest version

When using pip, the library can be updated to the latest version using

pip3 install --upgrade --no-deps --force-reinstall git+https://github.com/linto-ai/whisper-timestamped

A specific version of openai-whisper can be used by running, for example:

pip3 install openai-whisper==20230124

Usage

Python

In python, you can use the function whisper_timestamped.transcribe() that is similar to the function whisper.transcribe()

import whisper_timestamped
help(whisper_timestamped.transcribe)

The main difference with whisper.transcribe() is that the output will include a key "words" for all segments, with the word start and end position. Note that word will include punctuation. See example below.

Besides, default decoding options are different, in order to favour efficient decoding (greedy decoding instead of beam search, and no temperature sampling fallback). To have same default as in whisper, use beam_size=5, best_of=5, temperature=(0.0, 0.2, 0.4, 0.6, 0.8, 1.0).

There are also additional options related to word alignement.

In general, by importing whisper_timestamped instead of whisper in your python script, it should do the job, if you use transcribe(model, ...) instead of model.transcribe(...):

import whisper_timestamped as whisper

audio = whisper.load_audio("AUDIO.wav")

model = whisper.load_model("tiny", device="cpu")

result = whisper.transcribe(model, audio, language="fr")

import json
print(json.dumps(result, indent = 2, ensure_ascii = False))

Note that you can use a finetuned Whisper model from HuggingFace or a local folder, by using the load_model method of whisper_timestamped. For instance, if you want to use https://huggingface.co/NbAiLab/whisper-large-v2-nob you simply can do:

import whisper_timestamped as whisper

model = whisper.load_model("NbAiLab/whisper-large-v2-nob", device="cpu")

# ...

Command line

You can also use whisper_timestamped on the command line, similarly to whisper. See help with:

whisper_timestamped --help

The main differences with whisper CLI are:

  • Output files:
    • The output JSON contains word timestamps and confidence scores. See example below.
    • There is an additional CSV output format
    • For SRT, VTT, TSV formats, there will be additional files saved with word timestamps
  • Some default options are different:
    • By default, no output folder is set: Use --output_dir . for Whisper default
    • By default, there is no verbose: Use --verbose True for Whisper default
    • By default, beam search decoding and temperature sampling fallback are disabled, to favour an efficient decoding. To set the same as Whisper default, you can use --accurate (which is an alias for --beam_size 5 --temperature_increment_on_fallback 0.2 --best_of 5).
  • There are some additional specific options:
    • --compute_confidence to enable/disable the computation of confidence scores for each word.
    • --punctuations_with_words to decide whether punctuation marks should be included or not with preceding words.

An example command line to process several files with the tiny model and output results in the current folder as whisper would do by default:

whisper_timestamped audio1.flac audio2.mp3 audio3.wav --model tiny --output_dir .

Note that you can use a finetuned Whisper model from HuggingFace or a local folder. For instance, if you want to use https://huggingface.co/NbAiLab/whisper-large-v2-nob you simply can do:

whisper_timestamped --model NbAiLab/whisper-large-v2-nob <...>

Plot of word alignment

Note that you can use option plot_word_alignment of python function whisper_timestamped.transcribe(), or option --plot of whisper_timestamped CLI in order to see the word alignment for each segment.

Example alignement

  • The upper plot represents the transformation of cross-attention weights that is used for the alignement with Dynamic Time Warping. The abscissa represents the time and the ordinate represents the predicted tokens; with special timestamp tokens at first and at last, and then (sub)words and punctuations in the middle.
  • The lower plot is a MFCC representation of the input signal (features used by Whisper, based on Mel-frequency cepstrum).
  • The vertical dotted red lines show where the word boundaries are found (with punctuation marks "glued" with the previous word).

Example output

Here is an example output of whisper_timestamped.transcribe(), that can be seen by using CLI

whisper_timestamped AUDIO_FILE.wav --model tiny --language fr
{
  "text": " Bonjour! Est-ce que vous allez bien?",
  "segments": [
    {
      "id": 0,
      "seek": 0,
      "start": 0.5,
      "end": 1.2,
      "text": " Bonjour!",
      "tokens": [ 25431, 2298 ],
      "temperature": 0.0,
      "avg_logprob": -0.6674491882324218,
      "compression_ratio": 0.8181818181818182,
      "no_speech_prob": 0.10241222381591797,
      "confidence": 0.51,
      "words": [
        {
          "text": "Bonjour!",
          "start": 0.5,
          "end": 1.2,
          "confidence": 0.51
        }
      ]
    },
    {
      "id": 1,
      "seek": 200,
      "start": 2.02,
      "end": 4.48,
      "text": " Est-ce que vous allez bien?",
      "tokens": [ 50364, 4410, 12, 384, 631, 2630, 18146, 3610, 2506, 50464 ],
      "temperature": 0.0,
      "avg_logprob": -0.43492694334550336,
      "compression_ratio": 0.7714285714285715,
      "no_speech_prob": 0.06502953916788101,
      "confidence": 0.595,
      "words": [
        {
          "text": "Est-ce",
          "start": 2.02,
          "end": 3.78,
          "confidence": 0.441
        },
        {
          "text": "que",
          "start": 3.78,
          "end": 3.84,
          "confidence": 0.948
        },
        {
          "text": "vous",
          "start": 3.84,
          "end": 4.0,
          "confidence": 0.935
        },
        {
          "text": "allez",
          "start": 4.0,
          "end": 4.14,
          "confidence": 0.347
        },
        {
          "text": "bien?",
          "start": 4.14,
          "end": 4.48,
          "confidence": 0.998
        }
      ]
    }
  ],
  "language": "fr"
}

Options that may improve results

Here are some options not abled by default that might improve results.

Accurate Whisper transcription

As mentioned before, some decoding options are disabled by default for offering a better efficiency. But the quality of the transcription can be impacted. To run with the options that have the best chance to provide a good transcription, use the following options.

  • In python:
results = whisper_timestamped.transcribe(model, audio, beam_size=5, best_of=5, temperature=(0.0, 0.2, 0.4, 0.6, 0.8, 1.0), ...)
  • In the command line:
whisper_timestamped --accurate ...

Running Voice Activity Detection (VAD) before sending to Whisper

Whisper models can "hallucinate" text when a segment without speech is given. This can be avoided by running VAD and gluing speech segments together before transcribing with the Whisper model. This is possible in whisper-timestamped.

  • In python:
results = whisper_timestamped.transcribe(model, audio, vad=True, ...)
  • In the command line:
whisper_timestamped --vad True ...

Detecting disfluencies

Whisper models tend to remove speech disfluencies (filler words, hesitations, repetitions, ...). Without precautions, the disfluencies that are not transcribed will have an influence on the timestamp of the word that follows: the timestamp of the beginning of the word will actually be the timestamp of the beginning of the disfluencies. whisper-timestamped can implement some heuristics to avoid that.

  • In python:
results = whisper_timestamped.transcribe(model, audio, detect_disfluencies=True, ...)
  • In the command line:
whisper_timestamped --detect_disfluencies True ...

Important: Note that when using this options, possible disfluencies will appear in the transcription as a special "[*]" word.

Acknowlegment

  • whisper: Whisper speech recognition (License MIT).
  • dtw-python: Dynamic Time Warping (License GPL v3).

Citations

If you use this in your research, just cite the repo,

@misc{lintoai2023whispertimestamped,
  title={whisper-timestamped},
  author={Louradour, J{\'e}r{\^o}me},
  journal={GitHub repository},
  year={2023},
  publisher={GitHub},
  howpublished = {\url{https://github.com/linto-ai/whisper-timestamped}}
}

as well as OpenAI Whisper paper,

@article{radford2022robust,
  title={Robust speech recognition via large-scale weak supervision},
  author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  journal={arXiv preprint arXiv:2212.04356},
  year={2022}
}

and this paper for Dynamic-Time-Warping

@article{JSSv031i07,
  title={Computing and Visualizing Dynamic Time Warping Alignments in R: The dtw Package},
  author={Giorgino, Toni},
  journal={Journal of Statistical Software},
  year={2009},
  volume={31},
  number={7},
  doi={10.18637/jss.v031.i07}
}

More Repositories

1

linto-stt

An automatic speech recognition API
Python
38
star
2

WebVoiceSDK

Buildings block for voice-enabled applications in the browser
JavaScript
31
star
3

linto-studio

Transcription and annotation interface for recorded audio or video files
JavaScript
24
star
4

pyrtstools

Tools for speech processing, keyword spotting
Python
17
star
5

linto-diarization

Speaker diarization service
Python
16
star
6

linto-agent

LinTO platform services stack deployment tool for Docker Swarm cluster
JavaScript
15
star
7

linto-client

LinTO client-server connectivity
JavaScript
15
star
8

linto-os-generator

Build a LinTO OS Image which boots on Raspberry Pi3
Shell
13
star
9

linto-desktoptools-hmg

GUI Tool to create, manage and test Keyword Spotting models using TF 2.0
Python
12
star
10

linto-android-client

(WIP) Development folder for the LinTO android client.
Dart
12
star
11

linto-web-client

The Web Client for LinTO intégrations in webpages.
SCSS
8
star
12

FREDSum

Corpus of political debates : transcriptions and summaries
8
star
13

linto-platform-punctuation

LinTO Platform punctuation service.
Python
5
star
14

linto-platform-stt-standalone-worker-streaming

Streaming speech to text standalone worker. Processes audio streams on an WS endpoint
Python
5
star
15

homepage-website

HTML
4
star
16

linto-wakemeup

Wake me up is a collaborative interface to make audio sample acquisition and validation.
Vue
4
star
17

platform-conversation-manager-websocket

Websocket server for the conversation manager.
JavaScript
4
star
18

linto-platform-mongodb-migration

Scripts that might migrate LinTO Platform databases content when needed (version bumps, rollbacks...)
JavaScript
3
star
19

mfcc

A library written in Dart to extract Mel-Frequency Cepstral Coefficients (MFCCs) from a signal.
Dart
3
star
20

linto-platform-nlp-extractive-summarization

LinTO's NLP service: Extractive Summarization
Python
3
star
21

platform-conversation-manager-front

Front-end application for conversation manager API
Vue
3
star
22

roadmap

Public product Roadmap
3
star
23

linto-command-module

LinTO module for keywords and utterances spotting.
Python
3
star
24

linto-jitsi

Java
2
star
25

linto-client-stack

LinTO functional modules (GUI, Command Module, Text To Speech...)
Shell
2
star
26

linto-platform-nlp-topic-modeling

Python
2
star
27

linto-skills-components

Linto components used for making a linto skill
JavaScript
2
star
28

linto-platform-nlp-named-entity-recognition

Python
2
star
29

sfeatpy

Library to extract MFCC features from audio signal
Python
2
star
30

linto-tts-module

Provides a voice to LinTO using PICO
Python
2
star
31

linto-platform-text-punctuation-worker

An automatic text punctuation API
Dockerfile
2
star
32

gpu-ne10-mfcc

Some works on accelerating MFCC features extraction with NEON NE10 and Videocore IV on Raspberry Pi
C++
2
star
33

linto-platform-nlp-core

LinTO's NLP core services
Dockerfile
2
star
34

linto-skills-transcriber

Generate transcription from LinSTT API by file upload
JavaScript
1
star
35

linto-skill-memo

Linto skill memo for Node-RED
HTML
1
star
36

linto-skill-definition

Linto skill definition for Node-RED
HTML
1
star
37

linto-ui-module

This is the LinTO Touch User Interface. It relies on Python / Pygame
Python
1
star
38

linto-skills-template

Template for a linto skill
HTML
1
star
39

linto-platform-nlp-keyword-extraction

Python
1
star
40

linto-skill-pollution

Linto skill pollution for Node-RED
HTML
1
star
41

linto-platform-speaker-diarization-worker

Speaker diarization worker. The process of segmenting and co-indexing speech signals by speaker.
Python
1
star
42

linto-agent-skills

LinTO skill based on Node-Red
JavaScript
1
star