• Stars
    star
    397
  • Rank 107,924 (Top 3 %)
  • Language
    Python
  • License
    MIT License
  • Created over 1 year ago
  • Updated 11 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Minimal extension of OpenAI's Whisper adding speaker diarization with special tokens

tinydiarize 🐥🗣️

  • Speaker diarization labels who said what in a transcript (e.g. Speaker A, Speaker B …). It is essential for conversation transcripts like meetings or podcasts.
  • tinydiarize aims to be a minimal, interpretable extension of OpenAI's Whisper models that adds speaker diarization with few extra dependencies (inspired by minGPT).
  • This uses a finetuned model that adds special tokens to mark speaker changes [1,2,3,4]. It can use both voice and semantic context to tell speakers apart, which is a unique benefit of this approach.
  • It needs a tiny change to the inference code (<50 lines ), and runs with minimal extra cost. This makes it easy to add to ports like whisper.cpp that run on consumer hardware like MacBooks and iPhones.

Demo

demo_video-trim.mp4

You can try it out on other such gems from YouTube using this notebook. Open In Colab

Quickstart

Install ffmpeg following the original repo, then run:

pip install -e .
whisper --model small.en-tdrz AUDIO 

The only change is the small.en-tdrz model instead of small.en. That's it! 🎉

What's included?

  • Finetuned checkpoint for the small.en-tdrz model (located here) and example inference code (relevant edits in [#4] [#11]). This has the same dependencies as the original whisper repo.
  • Tools for comparison and analysis (under /tdrz_dev):
    • A scoring tool to measure and compare accuracy on your own data in an easy to interpret way.
    • A reference script to run and compare various diarization pipelines.
    • A Jupyter notebook to compare and understand performance in detail.
  • Finetuning code will also be made available shortly.

We aim to provide a starting point enabling anyone (or even OpenAI themselves!) to improve performance and extend support (multilingual, speech translation etc.).

Performance

metric small.en small.en-tdrz
spk_turn_precision - 97.7
spk_turn_recall - 70.8
wer_overall 11.0 10.3
wer_speaker_switch 15.0 15.5

On a (tiny) benchmark set of 3 earnings calls, tdrz gets near-perfect speaker turn precision at fairly decent recall. A similar WER is retained as the original model. Not too shabby for a tiny finetuning setup, and <10% extra inference cost!

Refer to tdrz_dev for details on performance analysis and comparisons.

More info

  • Whisper small.en checkpoints were finetuned on ~100hrs of AMI meetings using HuggingFace Transformers and Datasets.
  • With some tricks, this could be done relatively cheaply with just 30mins of 1 GPU training starting to produce decent results. Tiny indeed 😊.
  • We used helpful tools from pyannote (the OG open-source diarization toolkit) for finetuning data preparation and also analyze its performance.
  • We make use of the excellent open-source revdotcom/fstalign tool for scoring and analysis.
  • Stay tuned for details in an upcoming blog post! 📺

Gotchas

Note that this still an early proof-of-concept and there are a few things to be aware of:

  • Only the small.en English model has been finetuned.
  • Word-error-rate (WER) is close to original models, although not yet extensively tested. Ad-hoc inspection does show some differences in timestamp behavior (longer segments) or deletion errors. See the notebook under tdrz_dev for details.
  • Given a pretty tiny finetuning setup, there's likely a lot of room for further accuracy improvements.
  • Only local diarization (segmentation into speaker turns) is handled so far. Extension with global diarization (speaker clustering) is planned for later.
  • Stuff is still hacky and subject to change, so hold your horses just yet! 🐎

Roadmap

  • inference code & demo
  • scoring and analysis tools
  • whisper.cpp integration
  • reproducible dataprep + finetuning*
  • blog post explainer*
  • HuggingFace integration
  • better LoRa-based small.en checkpoint
  • possibly clustering with NME-SC?
  • possibly large-v2 checkpoint?

* is a pointer to where I am at the moment. Contributions will be easier to make after this, and are most welcome!

References

[1] Joint Speech Recognition and Speaker Diarization via Sequence Transduction [2] Serialized Output Training for End-to-End Overlapped Speech Recognition [3] Turn-to-Diarize: Online Speaker Diarization Constrained by Transformer Transducer Speaker Turn Detection [4] Adapting Multi-Lingual ASR Models for Handling Multiple Talkers

For information on the underlying Whisper model, please refer to the original documentation (release: 20230308)

License

Code and model weights are released under the MIT License. See LICENSE for further details.