• Stars
    star
    127
  • Rank 282,790 (Top 6 %)
  • Language
    Jupyter Notebook
  • Created over 1 year ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

⚠️ EXPERIMENTAL: Transcribe audio to any language w/ 🤗 Transformers

Whisper, released by OpenAI in late 2022, till date has a near-SoTA performance across English & Multi-lingual benchmarks.

The model was trained to do two key speech recognition tasks:

  1. Transcribe a given audio in its base language. i.e. take the audio in language "X" and transcribe it.
  2. Directly translate an audio to English. i.e. take audio in language "X" and transcribe into English.

As the world grows more and more connected, the need for high quality content is ever-so-increasing. One of the ways to make content more accessible (specially audio), is by transcribing it into different languages, thereby ensuring that the knowledge is spread. ⚡️

The typical workflow for transcribing from audio in language "X" to another language is as follows:

  1. Translate and transcribe the audio in language "X" to English. (Base Whisper behaviour)
  2. Translate the transcriptions from language "X" to another language. (Typically done with a LLM, you can use for example GPT-3.5/4)

This works great! However, as with any other process, the more steps you run, the higher the chances for error creep.

Could we transcribe from language "X" to "Y" in one step?

TL;DR - Yes! Keep in mind that this is a hack, but it seems to work pretty well in our tests! These notes describe how to do it, but serious use of the technique would have to be validated much more throughly! This is because the model wasn't trained on the task we'll use it for, so results may not be as reliable.

Alright, let's get to it! To demonstrate how this works, let's try to transcribe an audio in english (en) language to german (de), italian (it), spanish (es), dutch (nl) and french (fr).

For a more interactive experience you can follow along with this colab! Open In Colab

Note: This tutorial assumes that you have run huggingface-cli login or used notebook_login() to authenticate with the hub. We only need this to access Common Voice, you can safely ignore this step if you run inference on your own audio files or public datasets.

!pip -q install transformers datasets huggingface_hub

Let's instantiate our speech recognition pipeline! For the purpose of colab demo, we'll use a Whisper-large-v2 checkpoint in half-precision (fp16). If you have access to a larger GPU VRAM then remove the torch_dtype arg 🤗

import torch
from transformers import pipeline

whisper_asr = pipeline(
    "automatic-speech-recognition", 
    model="openai/whisper-large-v2",
    torch_dtype=torch.float16,
    device="cuda:0"
    )

To keep things simple, we'll use the Common Voice dataset from the 🤗 Hub in streaming mode & resample the audio to 16KHz as expected by Whisper.

from datasets import load_dataset
from datasets import Audio

common_voice_en = load_dataset("mozilla-foundation/common_voice_11_0", "en",
                               revision="streaming",
                               split="test",
                               streaming=True,
                               use_auth_token=True)

common_voice_en = common_voice_en.cast_column("audio",
                                              Audio(sampling_rate=16000))

Since we cannot render audio in markdown, let's take a look at the transcription.

next(iter(common_voice_en))["sentence"]

output:

Reading metadata...: 16354it [00:00, 31433.60it/s]
'Joe Keaton disapproved of films, and Buster also had reservations about the medium.'

Let's create a wee list of languages to transcribe too.

list_of_languages = ["de", "it", "es", "nl", "fr"]

Magic sauce 🍝

We essentially force Whisper to decode in one specific language. Because Whisper was trained on 600K+ hours of data it is able to do so fairly well.

So the only change to make this happen would be to set the task as transcribe and change the target language.

for lang in list_of_languages:
    whisper_asr.model.config.forced_decoder_ids = (
        whisper_asr.tokenizer.get_decoder_prompt_ids(
            language=lang,
            task="transcribe"
            )
        )
    print(whisper_asr(next(iter(common_voice_en))["audio"]["array"])["text"])

output:

Reading metadata...: 16354it [00:00, 33718.24it/s]
 Joe Keaton hat Filme verabschiedet und Buster hatte auch Reservations über die Medien.
Reading metadata...: 16354it [00:00, 27172.67it/s]
 Joe Keaton ha disapprovato i film e Buster ha anche delle riservazioni sui media.
Reading metadata...: 16354it [00:00, 41110.13it/s]
 Joe Keaton disaproveció de los filmes y Buster también tenía reservaciones sobre el medio.
Reading metadata...: 16354it [00:00, 39696.06it/s]
 Joe Keaton onverstaanbaar van de films en Buster had ook bewaarschuwingen over de media.
Reading metadata...: 16354it [00:00, 39813.59it/s]
 Joe Keaton a dénoncé les films et Buster avait des réservations sur le médium.

Voila! it works! We successfully transcribed an english audio to other languages.

Some of these translations are a bit off, but we can fix them with some neat generation techniques like constrastive search!

You can use contrastive search by providing penalty_alpha and top_p to the generate_kwargs in the pipeline. You can read more about it here. 🤗

for lang in list_of_languages:
    whisper_asr.model.config.forced_decoder_ids = (
        whisper_asr.tokenizer.get_decoder_prompt_ids(
            language=lang,
            task="transcribe"
            )
        )
    print(whisper_asr(
        next(iter(common_voice_en))["audio"]["array"], 
        generate_kwargs = 
         {
              "penalty_alpha": 0.6, 
              "top_k": 5,
         }
        )["text"])

output:

Reading metadata...: 16354it [00:00, 39409.04it/s]
 Joe Keaton verabschiedete sich von Filmen und Buster hatte Regeltäusen über die Medien.
Reading metadata...: 16354it [00:00, 34203.76it/s]
 Joe Keaton disapprovò i film e Buster aveva anche riservazioni sui media.
Reading metadata...: 16354it [00:00, 24372.39it/s]
 Joe Keaton aprovechó de los filmes y Buster también tenía reservaciones sobre el medio.
Reading metadata...: 16354it [00:00, 41170.46it/s]
 Joe Keaton onvoldoende films en Buster had ook besluitingen over het medium.
Reading metadata...: 16354it [00:00, 23721.35it/s]
 Joe Keaton n'approuve pas les films et Buster avait également des préjugés sur le media.

Notice the subtle differences in the transcription, and how it still gets some things wrong here and there. For your actual use-case, I'd recommend tuning these parameters a bit or use one of the fine-tuned models on the hub.

Good luck! 🤝

Next steps

  1. Contrastive Search often results in over-generation, find strategies to reduce the over-generation.
  2. Run a benchmark on FLoRES dataset.
  3. Test the benchmark for fine-tuned Whisper models.

Help is more than welcome! Just open an issue or PR and we can work together on this! 🤗

More Repositories

1

insanely-fast-whisper

Jupyter Notebook
7,212
star
2

fast-whisper-finetuning

Jupyter Notebook
320
star
3

ml-with-audio

HF's ML for Audio study group
Jupyter Notebook
159
star
4

fast-llm.rs

Rust
130
star
5

notebooks

Jupyter Notebook
54
star
6

10_days_of_deep_learning

10 days 10 different practical applications of Deep Learning (primarily NLP) using Tensorflow and Keras
Jupyter Notebook
31
star
7

on-device-llm-playground

A repo with scripts to test and play around with Facebook's recent llama models! 🤗
Python
25
star
8

ml-with-text

[Tutorial] Demystifying Natural Language Processing with Python
Jupyter Notebook
24
star
9

ml-with-timeseries

Machine Learning with Time Series data
Jupyter Notebook
19
star
10

how-to-asr

Jupyter Notebook
16
star
11

dcase-2023-workshop

Jupyter Notebook
14
star
12

deploy-audio-endpoints

Python
7
star
13

turbo-llm

Python
7
star
14

how-to-whisper

Jupyter Notebook
6
star
15

scratchpad

Jupyter Notebook
3
star
16

how-to-computational-linguistics

2
star
17

zomato-web-scraper

A nifty tool to scrape data off Zomato and mail it to you.
Python
2
star
18

learn-ml

Modified notebooks (single) from kaggle.com/learn with added nuances
Jupyter Notebook
2
star
19

news_classifier

Python
2
star
20

Toucan-Fork

Python
2
star
21

benchmark-asr

Python
2
star
22

score-audio

2
star
23

speech-ecosystem-scripts

2
star
24

Vaibhavs10

VB's GH landing page
Python
2
star
25

anli-performance-prediction

Python
1
star
26

simple-text-message-app

A simple text message notification app
Python
1
star
27

ml-on-gcp

The repository walks through a Data Scientist focused way of building and deploying Machine Learning models on Google Cloud
Jupyter Notebook
1
star
28

what-the-audio

1
star
29

snippets

Random but often useful snippets for day to day hacking!
Python
1
star
30

static-resume

A static resume
HTML
1
star
31

sentiment-movie-imdb

Simple review sentiment classifier!
Jupyter Notebook
1
star
32

summer_of_bitcoin

Python
1
star
33

RC-Interview-task

Graph manipulation using PostgreSQL and Networkx
Jupyter Notebook
1
star
34

facebook-bot-flask

A facebook messenger bot built using flask as a rest API
Python
1
star
35

junk_models

1
star
36

kaggle-titanic

An open cheat sheet which goes in somewhat detail in understanding the Machine Learning concepts and some code :)
Jupyter Notebook
1
star
37

common_voice_dataset_generator

Python
1
star
38

quora-question-pair

Code and analysis for Quora question pair challenge on Kaggle
Jupyter Notebook
1
star
39

homebred-tap

Ruby
1
star
40

stats101

Code and high level information to get started with Statistics and Math required for Machine Learning
Jupyter Notebook
1
star