• Stars
    star
    1,782
  • Rank 26,107 (Top 0.6 %)
  • Language
    Python
  • License
    MIT License
  • Created over 1 year ago
  • Updated 3 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Whisper realtime streaming for long speech-to-text transcription and translation

whisper_streaming

Whisper realtime streaming for long speech-to-text transcription and translation

Turning Whisper into Real-Time Transcription System

Demonstration paper, by Dominik Macháček, Raj Dabre, Ondřej Bojar, 2023

Abstract: Whisper is one of the recent state-of-the-art multilingual speech recognition and translation models, however, it is not designed for real time transcription. In this paper, we build on top of Whisper and create Whisper-Streaming, an implementation of real-time speech transcription and translation of Whisper-like models. Whisper-Streaming uses local agreement policy with self-adaptive latency to enable streaming transcription. We show that Whisper-Streaming achieves high quality and 3.3 seconds latency on unsegmented long-form speech transcription test set, and we demonstrate its robustness and practical usability as a component in live transcription service at a multilingual conference.

Paper PDF: https://aclanthology.org/2023.ijcnlp-demo.3.pdf

Demo video: https://player.vimeo.com/video/840442741

Slides -- 15 minutes oral presentation at IJCNLP-AACL 2023

Please, cite us. ACL Anthology, Bibtex citation:

@inproceedings{machacek-etal-2023-turning,
    title = "Turning Whisper into Real-Time Transcription System",
    author = "Mach{\'a}{\v{c}}ek, Dominik  and
      Dabre, Raj  and
      Bojar, Ond{\v{r}}ej",
    editor = "Saha, Sriparna  and
      Sujaini, Herry",
    booktitle = "Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics: System Demonstrations",
    month = nov,
    year = "2023",
    address = "Bali, Indonesia",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.ijcnlp-demo.3",
    pages = "17--24",
}

Installation

  1. pip install librosa soundfile -- audio processing library

  2. Whisper backend.

Several alternative backends are integrated. The most recommended one is faster-whisper with GPU support. Follow their instructions for NVIDIA libraries -- we succeeded with CUDNN 8.5.0 and CUDA 11.7. Install with pip install faster-whisper.

Alternative, less restrictive, but slower backend is whisper-timestamped: pip install git+https://github.com/linto-ai/whisper-timestamped

Thirdly, it's also possible to run this software from the OpenAI Whisper API. This solution is fast and requires no GPU, just a small VM will suffice, but you will need to pay OpenAI for api access. Also note that, since each audio fragment is processed multiple times, the price will be higher than obvious from the pricing page, so keep an eye on costs while using. Setting a higher chunk-size will reduce costs significantly. Install with: pip install openai

For running with the openai-api backend, make sure that your OpenAI api key is set in the OPENAI_API_KEY environment variable. For example, before running, do: export OPENAI_API_KEY=sk-xxx with sk-xxx replaced with your api key.

The backend is loaded only when chosen. The unused one does not have to be installed.

  1. Optional, not recommended: sentence segmenter (aka sentence tokenizer)

Two buffer trimming options are integrated and evaluated. They have impact on the quality and latency. The default "segment" option performs better according to our tests and does not require any sentence segmentation installed.

The other option, "sentence" -- trimming at the end of confirmed sentences, requires sentence segmenter installed. It splits punctuated text to sentences by full stops, avoiding the dots that are not full stops. The segmenters are language specific. The unused one does not have to be installed. We integrate the following segmenters, but suggestions for better alternatives are welcome.

  • pip install opus-fast-mosestokenizer for the languages with codes as bn ca cs de el en es et fi fr ga gu hi hu is it kn lt lv ml mni mr nl or pa pl pt ro ru sk sl sv ta te yue zh

  • pip install tokenize_uk for Ukrainian -- uk

  • for other languages, we integrate a good performing multi-lingual model of wtpslit. It requires pip install torch wtpsplit, and its neural model wtp-canine-s-12l-no-adapters. It is downloaded to the default huggingface cache during the first use.

  • we did not find a segmenter for languages as ba bo br bs fo haw hr ht jw lb ln lo mi nn oc sa sd sn so su sw tk tl tt that are supported by Whisper and not by wtpsplit. The default fallback option for them is wtpsplit with unspecified language. Alternative suggestions welcome.

In case of installation issues of opus-fast-mosestokenizer, especially on Windows and Mac, we recommend using only the "segment" option that does not require it.

Usage

Real-time simulation from audio file

usage: whisper_online.py [-h] [--min-chunk-size MIN_CHUNK_SIZE] [--model {tiny.en,tiny,base.en,base,small.en,small,medium.en,medium,large-v1,large-v2,large-v3,large}] [--model_cache_dir MODEL_CACHE_DIR] [--model_dir MODEL_DIR] [--lan LAN] [--task {transcribe,translate}]
                         [--backend {faster-whisper,whisper_timestamped,openai-api}] [--vad] [--buffer_trimming {sentence,segment}] [--buffer_trimming_sec BUFFER_TRIMMING_SEC] [--start_at START_AT] [--offline] [--comp_unaware]
                         audio_path

positional arguments:
  audio_path            Filename of 16kHz mono channel wav, on which live streaming is simulated.

options:
  -h, --help            show this help message and exit
  --min-chunk-size MIN_CHUNK_SIZE
                        Minimum audio chunk size in seconds. It waits up to this time to do processing. If the processing takes shorter time, it waits, otherwise it processes the whole segment that was received by this time.
  --model {tiny.en,tiny,base.en,base,small.en,small,medium.en,medium,large-v1,large-v2,large-v3,large}
                        Name size of the Whisper model to use (default: large-v2). The model is automatically downloaded from the model hub if not present in model cache dir.
  --model_cache_dir MODEL_CACHE_DIR
                        Overriding the default model cache dir where models downloaded from the hub are saved
  --model_dir MODEL_DIR
                        Dir where Whisper model.bin and other files are saved. This option overrides --model and --model_cache_dir parameter.
  --lan LAN, --language LAN
                        Source language code, e.g. en,de,cs, or 'auto' for language detection.
  --task {transcribe,translate}
                        Transcribe or translate.
  --backend {faster-whisper,whisper_timestamped,openai-api}
                        Load only this backend for Whisper processing.
  --vad                 Use VAD = voice activity detection, with the default parameters.
  --buffer_trimming {sentence,segment}
                        Buffer trimming strategy -- trim completed sentences marked with punctuation mark and detected by sentence segmenter, or the completed segments returned by Whisper. Sentence segmenter must be installed for "sentence" option.
  --buffer_trimming_sec BUFFER_TRIMMING_SEC
                        Buffer trimming length threshold in seconds. If buffer length is longer, trimming sentence/segment is triggered.
  --start_at START_AT   Start processing audio at this time.
  --offline             Offline mode.
  --comp_unaware        Computationally unaware simulation.

Example:

It simulates realtime processing from a pre-recorded mono 16k wav file.

python3 whisper_online.py en-demo16.wav --language en --min-chunk-size 1 > out.txt

Simulation modes:

  • default mode, no special option: real-time simulation from file, computationally aware. The chunk size is MIN_CHUNK_SIZE or larger, if more audio arrived during last update computation.

  • --comp_unaware option: computationally unaware simulation. It means that the timer that counts the emission times "stops" when the model is computing. The chunk size is always MIN_CHUNK_SIZE. The latency is caused only by the model being unable to confirm the output, e.g. because of language ambiguity etc., and not because of slow hardware or suboptimal implementation. We implement this feature for finding the lower bound for latency.

  • --start_at START_AT: Start processing audio at this time. The first update receives the whole audio by START_AT. It is useful for debugging, e.g. when we observe a bug in a specific time in audio file, and want to reproduce it quickly, without long waiting.

  • --offline option: It processes the whole audio file at once, in offline mode. We implement it to find out the lowest possible WER on given audio file.

Output format

2691.4399 300 1380 Chairman, thank you.
6914.5501 1940 4940 If the debate today had a
9019.0277 5160 7160 the subject the situation in
10065.1274 7180 7480 Gaza
11058.3558 7480 9460 Strip, I might
12224.3731 9460 9760 have
13555.1929 9760 11060 joined Mrs.
14928.5479 11140 12240 De Kaiser and all the
16588.0787 12240 12560 other
18324.9285 12560 14420 colleagues across the

See description here

As a module

TL;DR: use OnlineASRProcessor object and its methods insert_audio_chunk and process_iter.

The code whisper_online.py is nicely commented, read it as the full documentation.

This pseudocode describes the interface that we suggest for your implementation. You can implement any features that you need for your application.

from whisper_online import *

src_lan = "en"  # source language
tgt_lan = "en"  # target language  -- same as source for ASR, "en" if translate task is used

asr = FasterWhisperASR(lan, "large-v2")  # loads and wraps Whisper model
# set options:
# asr.set_translate_task()  # it will translate from lan into English
# asr.use_vad()  # set using VAD

online = OnlineASRProcessor(asr)  # create processing object with default buffer trimming option

while audio_has_not_ended:   # processing loop:
	a = # receive new audio chunk (and e.g. wait for min_chunk_size seconds first, ...)
	online.insert_audio_chunk(a)
	o = online.process_iter()
	print(o) # do something with current partial output
# at the end of this audio processing
o = online.finish()
print(o)  # do something with the last output


online.init()  # refresh if you're going to re-use the object for the next audio

Server -- real-time from mic

whisper_online_server.py has the same model options as whisper_online.py, plus --host and --port of the TCP connection. See help message (-h option).

Client example:

arecord -f S16_LE -c1 -r 16000 -t raw -D default | nc localhost 43001
  • arecord sends realtime audio from a sound device (e.g. mic), in raw audio format -- 16000 sampling rate, mono channel, S16_LE -- signed 16-bit integer low endian. (use the alternative to arecord that works for you)

  • nc is netcat with server's host and port

Background

Default Whisper is intended for audio chunks of at most 30 seconds that contain one full sentence. Longer audio files must be split to shorter chunks and merged with "init prompt". In low latency simultaneous streaming mode, the simple and naive chunking fixed-sized windows does not work well, it can split a word in the middle. It is also necessary to know when the transcribt is stable, should be confirmed ("commited") and followed up, and when the future content makes the transcript clearer.

For that, there is LocalAgreement-n policy: if n consecutive updates, each with a newly available audio stream chunk, agree on a prefix transcript, it is confirmed. (Reference: CUNI-KIT at IWSLT 2022 etc.)

In this project, we re-use the idea of Peter Polák from this demo: https://github.com/pe-trik/transformers/blob/online_decode/examples/pytorch/online-decoding/whisper-online-demo.py However, it doesn't do any sentence segmentation, but Whisper produces punctuation and the libraries faster-whisper and whisper_transcribed make word-level timestamps. In short: we consecutively process new audio chunks, emit the transcripts that are confirmed by 2 iterations, and scroll the audio processing buffer on a timestamp of a confirmed complete sentence. The processing audio buffer is not too long and the processing is fast.

In more detail: we use the init prompt, we handle the inaccurate timestamps, we re-process confirmed sentence prefixes and skip them, making sure they don't overlap, and we limit the processing buffer window.

Contributions are welcome.

Performance evaluation

See the paper.

Contact

Dominik Macháček, [email protected]

More Repositories

1

neuralmonkey

An open-source tool for sequence learning in NLP built on TensorFlow.
Python
410
star
2

udpipe

UDPipe: Trainable pipeline for tokenizing, tagging, lemmatizing and parsing Universal Treebanks and other CoNLL-U files
C++
344
star
3

acl2019_nested_ner

Source code for paper Neural Architectures for Nested NER through Linearization
Python
91
star
4

unilib

Embeddable C++17 Unicode library offering UTF encodings, general category info, simple and full casing, normalization forms, and combining marks stripping.
C++
73
star
5

morphodita

MorphoDiTa: Morphologic Dictionary and Tagger
C++
65
star
6

public-license-selector

Tool that will help you select the right open license for your data or software
CoffeeScript
52
star
7

perin

PERIN is Permutation-Invariant Semantic Parser developed for MRP 2020
Python
44
star
8

nametag

NameTag: Named Entity Tagger
C++
38
star
9

mtmonkey

Distributed infrastructure for Machine Translation web services (using Moses, Python, JSON-RPC/web interface)
Python
33
star
10

treex

Treex NLP framework
Perl
33
star
11

npfl114

Materials for the Deep Learning -- ÚFAL course NPFL114
Python
29
star
12

npfl129

NPFL129 repository
Python
29
star
13

lindat-translation

Frontend of LINDAT translation service
Python
25
star
14

factgenie

Lightweight self-hosted span annotation tool
JavaScript
19
star
15

augpt

DSTC9 Submission
Python
18
star
16

korektor

Statistical spell- and (occasional) grammar-checker.
C++
17
star
17

npfl117

Deep Learning Seminar -- ÚFAL course NPFL117
17
star
18

multilexnorm2021

MultiLexNorm 2021 competition system from ÚFAL
Python
15
star
19

parsito

Parsito: Fast non-projective transition-based dependency parser
C++
14
star
20

npfl122

NPFL122 repository
Python
13
star
21

microrestd

MicroRestD is a small C++11 cross-platform REST server built on top of libmicrohttpd http://www.gnu.org/software/libmicrohttpd/.
C++
13
star
22

low-resource-gec-wnut2019

Source code for paper Grammatical Error Correction in Low-Resource Scenarios (W-NUT 2019)
Python
11
star
23

correctable-lecture-translator

A system for live lecture translation (speech to text) where the audience can easily provide corrections.
Python
9
star
24

olimpic-icdar24

Practical End-to-End Optical Music Recognition for Pianoform Music
Python
9
star
25

pytreex

A minimal Python implementation of the Treex API
Python
8
star
26

linpipe

LinPipe: Multilingual Processing Tool
C
8
star
27

nlgi_eval

NLI evaluation for NLG
Python
8
star
28

chu_liu_edmonds

Chu-Liu-Edmonds maximum spanning algorithm from TurboParser for use within Python
C++
7
star
29

marian-tensorboard

a simple tool to parse marian training logs and display them in tensorboard
Python
7
star
30

sigmorphon2019

UFAL-Prague entry to the Sigmorphon 2019 Shared Task 2
Python
6
star
31

hamledt

Makefiles, scenarios and support scripts for the development of HamleDT within the Treex infrastructure
Makefile
6
star
32

wnut2021_character_transformations_gec

The code from the paper Character Transformations for Non-Autoregressive GEC Tagging
Python
6
star
33

lindat-repository-obsolete

LINDAT/CLARIN repository for linguistics (http://lindat.cz)
Java
6
star
34

charles-translator-web-frontend

Charles Translator: MT from Charles University
TypeScript
6
star
35

clarin-sp-aaggregator

PHP
5
star
36

mrpipe-conll2019

ÚFAL MRPipe submission to CoNLL 2019 shared task
Python
5
star
37

slimd

SliMD presentation system based on Markdown and HTML5&js.
JavaScript
5
star
38

universal-segmentations

Build scripts for the UniSegments collection of morphologically segmented lexicons for many languages
Python
5
star
39

UFAL_poster

Latex repository for a poster design
TeX
4
star
40

bert-diacritics-restoration

Repository storing code and data for our paper "Diacritics Restoration using BERT with Analysis on Czech language".
Python
4
star
41

MLASK

EACL 2023 paper "MLASK: Multimodal Summarization of Video-based News Articles"
Python
4
star
42

evalatin2024-latinpipe

LatinPipe – the winning entry to parsing task of EvaLatin 2024
Python
4
star
43

optimal-reference-translations

Python
4
star
44

conll2017

CoNLL 2017 Shared Task Proposal: UD End-to-End parsing
Perl
3
star
45

wiki-error-corpus

Scripts for extracting errors from Wikipedia revisions
Python
3
star
46

weighteddist

A tiny toolkit for weighted word/character edit distance, including cost estimation.
C
3
star
47

rg

ÚFAL Reading Group
3
star
48

thesis_info

ÚFAL Thesis Information Repository
Python
3
star
49

perl-pmltq

Query engine and query language for trees in PML format
Perl
3
star
50

rh_nntagging

Reading Hackathon -- NN Tagging Project
Python
3
star
51

perl-pmltq-server

Refactored and simplified PMLTQ::CGI
Perl
3
star
52

pcedt2.0-coref

Coreference extension to Prague Czech-English Dependency Treebank 2.0
Makefile
3
star
53

kazitext

Python
3
star
54

corefud-scorer

Coreference and anaphora scorer for CorefUD data
Python
3
star
55

quickjudge

A handy tool for quick manual evaluation of line-oriented outputs, e.g. of machine translation.
Perl
3
star
56

teitok-tools

Conversion tools to and from the TEITOK TEI/XML format
Perl
2
star
57

conll2018

CoNLL 2018 UD Shared Task
Perl
2
star
58

charles-translator-android

Android app of LINDAT translation service
Kotlin
2
star
59

crac2023-corpipe

ÚFAL CorPipe: CRAC 2023 Winning System for Multilingual Coreference Resolution
Python
2
star
60

qtleap

QTLeap Pilot MT systems using TectoMT
Perl
2
star
61

PDT-C

Consolidated Czech PDT-style annotated corpus; consists of PDT, Czech part of PCEDT, PDTSC, PDT-Faust
2
star
62

lindat-corpora-conversions

LINDAT Corpora Conversions
Python
2
star
63

lindat-aai-attributes

Parse shibboleth logs for important information about attributes from IdPs and other
XSLT
2
star
64

ufal-tools

Perl
2
star
65

deltacorpus

Delexicalized tagging and parsing.
Python
2
star
66

js-treex-view

Javascript library for visualizing Treex files
JavaScript
2
star
67

phd-thesis-template

A template PhD thesis at UFAL
TeX
2
star
68

cpp_builtem

C++ Builtem is a cross-platform Makefile-based build system for C++11
Shell
2
star
69

ambiguity-grammaticality-complexity

Code for the paper Sentence Ambiguity, Grammaticality and Complexity Probes
Python
2
star
70

lindat-common

Common files and branding for Lindat projects
JavaScript
2
star
71

crac2022-corpipe

ÚFAL CorPipe: CRAC 2022 Winning System for Multilingual Coreference Resolution
Python
2
star
72

lindat_piwik_reports

Cashing important counts from PIWIK periodically and creating customized reports for LINDAT/CLARIN
JavaScript
2
star
73

eyetracked-multi-modal-translation

EMMT (Eyetracked Multi-Modal Translation), a simultaneous eye-tracking, 4-electrode EEG and audio corpus for multi-modal reading and translation scenarios
2
star
74

uk-cs-data-scripts

Scripts for processing data for Czech-Ukrainian MT
Python
2
star
75

errant_czech

Python
2
star
76

UFAL_MT_service

Python
1
star
77

nametag3

NameTag3: Named Entity Tagger
Python
1
star
78

mrptask

Perl
1
star
79

lindat-aai-discovery

HTML
1
star
80

pyclarindspace

Python package using clarin-dspace API
Jupyter Notebook
1
star
81

ParCzech

ParCzech is a project on compiling Czech parliamentary data into annotated corpora.
GLSL
1
star
82

theaitrobot

THEaiTRE bot
Python
1
star
83

auto-hume

Semantic MT metric trained on HUME annotations
Python
1
star
84

npfl101

Repository of the seminar NPFL101 Competing in Machine Translation.
Shell
1
star
85

bilingual-abstracts-corpus

Bilingual corpus of scientific abstracts from ÚFAL Charles University publications.
Python
1
star
86

continuous-rating

PHP
1
star
87

tamiltb

Makefile
1
star
88

nmt-pe-effects-2021

Experiment relating NMT quality and post-editing efforts
Jupyter Notebook
1
star
89

MTEQA

Python
1
star
90

cpp_utils

UFAL C++ Utils
C++
1
star
91

europarlmin

Corpus of European Parliament debates organized as a corpus for meeting summarization, i.e. matching full transcripts and minutes from the sessions. Used in the shared task of AutoMin 2023.
1
star
92

pmltq-cgi

PMLTQ::CGI has been removed from PMLTQ module in order to decrease number of dependencies. It should be installed separately.
Perl
1
star
93

qsubmit

A wrapper over various grid submission scripts
Python
1
star
94

SynSemClassSearch

JavaScript
1
star
95

ker

Simple Czech and English keyword extractor
Python
1
star
96

npfl087

NPFL087 Statistical Machine Translation
Shell
1
star
97

diaser

Python
1
star
98

treex-web

Online interface for Treex
JavaScript
1
star
99

wembedding_service

TF2 service for word embeddings computation
Python
1
star
100

NPFL095

web of the course "Modern Methods in Computational Linguistics"
1
star