tinydiarize 🐥🗣️

Speaker diarization labels who said what in a transcript (e.g. Speaker A, Speaker B …). It is essential for conversation transcripts like meetings or podcasts.
tinydiarize aims to be a minimal, interpretable extension of OpenAI's Whisper models that adds speaker diarization with few extra dependencies (inspired by minGPT).
This uses a finetuned model that adds special tokens to mark speaker changes [1,2,3,4]. It can use both voice and semantic context to tell speakers apart, which is a unique benefit of this approach.
It needs a tiny change to the inference code (<50 lines ), and runs with minimal extra cost. This makes it easy to add to ports like whisper.cpp that run on consumer hardware like MacBooks and iPhones.

Demo

demo_video-trim.mp4

You can try it out on other such gems from YouTube using this notebook.

Quickstart

Install ffmpeg following the original repo, then run:

pip install -e .
whisper --model small.en-tdrz AUDIO

The only change is the small.en-tdrz model instead of small.en. That's it! 🎉

What's included?

Finetuned checkpoint for the small.en-tdrz model (located here) and example inference code (relevant edits in [#4] [#11]). This has the same dependencies as the original whisper repo.
Tools for comparison and analysis (under /tdrz_dev):
- A scoring tool to measure and compare accuracy on your own data in an easy to interpret way.
- A reference script to run and compare various diarization pipelines.
- A Jupyter notebook to compare and understand performance in detail.
Finetuning code will also be made available shortly.

We aim to provide a starting point enabling anyone (or even OpenAI themselves!) to improve performance and extend support (multilingual, speech translation etc.).

Performance

metric	small.en	small.en-tdrz
spk_turn_precision	-	97.7
spk_turn_recall	-	70.8
wer_overall	11.0	10.3
wer_speaker_switch	15.0	15.5

On a (tiny) benchmark set of 3 earnings calls, tdrz gets near-perfect speaker turn precision at fairly decent recall. A similar WER is retained as the original model. Not too shabby for a tiny finetuning setup, and <10% extra inference cost!

Refer to tdrz_dev for details on performance analysis and comparisons.

More info

Whisper small.en checkpoints were finetuned on ~100hrs of AMI meetings using HuggingFace Transformers and Datasets.
With some tricks, this could be done relatively cheaply with just 30mins of 1 GPU training starting to produce decent results. Tiny indeed 😊.
We used helpful tools from pyannote (the OG open-source diarization toolkit) for finetuning data preparation and also analyze its performance.
We make use of the excellent open-source revdotcom/fstalign tool for scoring and analysis.
Stay tuned for details in an upcoming blog post! 📺

Gotchas

Note that this still an early proof-of-concept and there are a few things to be aware of:

Only the small.en English model has been finetuned.
Word-error-rate (WER) is close to original models, although not yet extensively tested. Ad-hoc inspection does show some differences in timestamp behavior (longer segments) or deletion errors. See the notebook under tdrz_dev for details.
Given a pretty tiny finetuning setup, there's likely a lot of room for further accuracy improvements.
Only local diarization (segmentation into speaker turns) is handled so far. Extension with global diarization (speaker clustering) is planned for later.
Stuff is still hacky and subject to change, so hold your horses just yet! 🐎

Roadmap

* is a pointer to where I am at the moment. Contributions will be easier to make after this, and are most welcome!

References

[1] Joint Speech Recognition and Speaker Diarization via Sequence Transduction [2] Serialized Output Training for End-to-End Overlapped Speech Recognition [3] Turn-to-Diarize: Online Speaker Diarization Constrained by Transformer Transducer Speaker Turn Detection [4] Adapting Multi-Lingual ASR Models for Handling Multiple Talkers

For information on the underlying Whisper model, please refer to the original documentation (release: 20230308)

License

Code and model weights are released under the MIT License. See LICENSE for further details.

akashmjn/tinydiarize

akashmjn

Reviews

Repository Details