Using IBM Watson's Speech-to-Text API to do Multi-Threaded Transcription of Really Long and Talky Videos, Such as Presidential Debates
A demonstration of how to use Python and IBM Watson's Speech-to-Text API to do some decently accurate transcription of real-world video and audio, at amazingly fast speeds.
Note:_ I'm just spit-balling code here, not making a user-friendly package. I'm focused on making an automated workflow to create fun supercuts of "The Wire"...and will polish the scripts and implementation later. These notes and scripts (and data files) are merely for your reference.
tl;dr
IBM Watson offers a REST-based Speech to Text API that allows free usage for the first 1,000 minutes each month (and $0.02 for each additional minute):
Watson Speech to Text can be used anywhere there is a need to bridge the gap between the spoken word and its written form. This easy-to-use service uses machine intelligence to combine information about grammar and language structure with knowledge of the composition of an audio signal to generate an accurate transcription. It uses IBM's speech recognition capabilities to convert speech in multiple languages into text. The transcription of incoming audio is continuously sent back to the client with minimal delay, and it is corrected as more speech is heard.
In my preliminary tests, it's not quite as good as Google Translate in terms of pure accuracy, but it's more than good enough for finding key words, whether they be relatively common verbs like "fight", "death", "kill" or proper nouns, such as Obama and countries of the world.
But it doesn't do too badly on very common (and aurally-ambiguous) short words such as pronouns and articles. Because Watson provides a confidence level for each word, it's possible to write scripts to programmatically filter out ambiguous words.
Here's a YouTube playlist of some automated supercuts I've created from the U.S. presidential primary debates. My favorite is probably this supercut of Senator Sanders and Secretary Clinton saying fighting words.
Here's the JSON returned from Watson, which includes word-by-word timestamps and confidence levels. Here's a simplified version of it, in which the JSON is just a flat list of words.
IBM Watson's API is robust enough to accept many concurrent requests. In the sample scripts I've included in this repo, I was able to break up a 90 minute debate into 5 minute segments and send them up to Watson simultaneously...resulting in a 6 to 7 minute processing time for the entire 90 minutes.
Some non-presidential examples:
- Attempting to transcribe the profanities in The Wire's "Old Cases" episode -- (youtube supercut, obviously nsfw)
- Attempting to transcribe a ProPublica podcast
Quick *nix check!
Before you look at the scary Python framework I've built for myself, you should first if you can work with movie/audio files and connect to Watson, using nothing but Unix tools: ffmpeg, and good ol' curl: check out this brief walkthrough
Supercut fun
You probably want to see the final product. I'm too lazy to document all the code and haven't organized it yet, but here's one result: making supercuts by grepping the Watson Speech to Text data for certain words. For example, to find all "fighting words", e.g. war, wars, warriors, fight, bomb, kill, threat, terror, death, murder, torture:
python supercut.py republican-debate-sc-2016-02-13 '\bwar(?:riors?|s)?\b|fight|bomb|kill|threat|terror|death|murder|tortur'
Here's a playlist of sample supercuts of presidential people:
Republican Debate, South Carolina, 2016-02-13:
- PEOPLE
- America, and other geopolitical words
- Obama, Clinton, and Bush
- Fighting words (fight, bomb, kill)
- Negative words (wrong, bad)
- Positive words (good, best)
Democratic Debate, Wisconsin, 2016-02-11:
- PEOPLE
- America, and other geopolitical words
- Obama, Clinton, and Bush
- Fighting words (fight, bomb, kill)
- But, Why, Not, Yes, Now
Obama weekly address
The technical details
How it works
After you've downloaded a video file to disk, the assorted scripts and commands in this repo will:
- Convert the file to mp4 if necessary
- Create a project subfolder to store the video file and all derived audio and transcripts file
- Extract the audio from as 16-bit, 16khz WAV files
- Split the audio into segments (300 seconds each, by default)
- Send each of those segments to Watson's API to be analyzed and transcribed.
- Saves the raw responses from Watson's API for each audio file
- Compiles all of the resulting responses into one data file, as if you had sent the entire audio file to be analyzed in a single go.
The advantages of splitting up the audio is that it allows the transcription to be done in parallel. An hour-long audio track would take probably an hour to get a response back (if your internet connection doesn't fail), whereas 60 parallel requests to analyze 1-minute each will take roughly...1 minute to complete.
I haven't tested the upper-bounds in concurrent requests to Watson's API, though I was able to send around 30 5-minute requests all at once without getting an errors.
Here are some sample results in the projects/ folder:
- The Republican Presidential Debate, South Carolina, Feb. 13, 2016
- Donald Trump's "Live Free or Die" commercial
- President Obama's Weekly Video Address, Oct. 31, 2015
Requirements
IBM Watson
The transcription power comes from IBM Watson's Speech-to-Text REST API. After cutting up a video into 5-minute segments, I then upload all of the audio files in parallel to Watson, which can complete the entire batch in nearly just 5 minutes.
Getting started with IBM Bluemix
You have to sign up for an IBM Bluemix account, which is free and doesn't require a credit card for the first month.
After signing up for Bluemix, you can find the console page for the speech-to-text API here, where you can get user credentials. This repo contains a sample file: credsfile_watson.SAMPLE.json
The pricing is pretty generous, in terms of testing things out: 1,000 minutes free each month. Every additional minute is $0.02 -- i.e. transcribing an hour's worth of audio will cost $1.20.
Quickie Watson Testy!
Before you get into the Python stuff, you should see if you are properly initialized with Watson by making contact with it from the command-line (i.e. bash, i.e. uh not sure if it will work on Windows like this):
If you don't have a WAV file at hand, you can install the youtube-dl command-line tool:
$ pip install youtube-dl
And then download Trump's Live Free or Die commercial. The following command downloads a movie file, bb4TxjvQlh0.mkv
, and extracts a WAV file named bb4TxjvQlh0.wav
:
youtube-dl "https://www.youtube.com/watch?v=bb4TxjvQlh0" \
--keep-video \
--extract-audio \
--audio-format wav \
--audio-quality 16K \
--id
In the next step, I assume you have a file named bb4TxjvQlh0.wav
, but you are free to use any WAV audio file.
(Note: the whole movie-file thing is totally ancillary...Watson doesn't care if the audio file comes from a movie or you recording into your microphone or whatever. But people like to transcribe videos, which is why I include the step.)
This next step is what contacts Watson's API. Replace USERNAME
and PASSWORD
with whatever credentials you got from the IBM Bluemix Developer Panel.
The --data-binary
flag wants a file name (prepended with @
).
When the audio file is uploaded and Watson returns a response, it will be saved to transcript.json
curl -X POST \
-u USERNAME:PASSWORD \
-o transcript.json \
--header "Content-Type: audio/wav" \
--header "Transfer-Encoding: chunked" \
--data-binary "@bb4TxjvQlh0.wav" \
"https://stream.watsonplatform.net/speech-to-text/api/v1/recognize?continuous=true×tamps=true&word_confidence=true&profanity_filter=false"
If this doesn't work for you, then either your Internet is down, Watson is down, or you don't have the proper user/password credentials.
Python stuff
This project uses:
- Anaconda 3-2.4.0
- Python 3.5.1
- Requests
- moviepy - currently, just being used as a very nice wrapper around ffmpeg, to do audio-video conversion and extraction. But has a lot of potential for laughter and games via programmatic editing.
- moviepy will install ffmpeg if you don't already have it installed
Demonstrations
Republican Debate in South Carolina, Feb. 13, 2016
Check out the projects/republican-debate-sc-2016-02-13 folder in this repo to see the raw JSON response files and their corresponding .WAV audio, as extracted from the Feb. 13, 2016 Republican Presidential Candidate debate in South Carolina:
Donald Trump "Live Free or Die" commercial (39 seconds)
The commercial can be seen here on YouTube:
The project directory generated: projects/trump-nh/
Because the video is so short, the directory includes the video file, the extracted audio, as well as the segmented audio and raw Watson JSON responses. For this example, I made the segments 10 seconds long.
To compile the transcript text:
import json
from glob import glob
filenames = glob("./projects/trump-nh/transcripts/*.json")
for fn in filenames:
with open(fn, 'r') as t:
data = json.loads(t.read())
for x in data['results']:
best_alt = x['alternatives'][0]
print(best_alt['transcript'])
The result:
this great slogan of the Hampshire live free or die means so much
so many people all over the world they use that expression it means liberty it means freedom it means free enterprise
mean safe
the insecurity it means borders it means strong strong military where nobody's going to mess with us it means taking care of our vets
what a great slogan congradulations New Hampshire
wonderful job dnmt I and
Note that the last 3 tokens, dmnt I and
, are a result of the Watson API getting confused by the dramatic music that closes the commercial. Luckily, the JSON response includes, among timestamp data for each work, a confidence level as well.
It actually is spot on for Trump's full closing sentence (not sure why "congradulations" is used...)...the confidence levels for dmnt I and
were very low comparatively...I think dmnt
is some kind of code word used by the API to indicate something, not that Watson thought that dmnt
was actually said (see the full JSON response here)
{
"word_confidence": [
[
"what",
0.9999999999999674
],
[
"a",
0.9999999999999672
],
[
"great",
0.999999999999967
],
[
"slogan",
0.9964234383591973
],
[
"congradulations",
0.7798716606178608
],
[
"New",
0.9999999999999933
],
[
"Hampshire",
0.9845177369977128
]
]
}
President Obama weekly address for October 31, 2015 (3 minutes)
Here's a quick demonstration of Watson's accuracy given a weekly video address from President Obama (~3 minutes):
- Video landing page at Whitehouse.gov
- Video file: 103115_WeeklyAddress.mp4
- Audio file: 00000-00190.wav
- Watson JSON response: 00000-00190.json
- The produced file folder: projects/obama-weekly-address-2015-10-31/
(because President Obama's video address is just about 3 minutes long, only audio file is extracted, and only one call to Watson's API is made)
Right now there's just a bunch of sloppy scripts that need to be refactored. There's a script named init.py that you can run from the command-line that will read an existing video file, create a project folder, cut up the audio, and do the transcriptions. It assumes that you have a file named credsfile_watson.json
relative to init.py
.
Some code for the commandline, to download the file, then to run init.py
:
curl -o "/tmp/obama-weekly-address-2015-10-31.mp4" \
https://www.whitehouse.gov/WeeklyAddress/2015/103115-QREDSC/103115_WeeklyAddress.mp4
python init.py /tmp/obama-weekly-address-2015-10-31.mp4
The output produced by init.py
:
[MoviePy] Writing audio in /Users/dtown/watson-word-watcher/projects/obama-weekly-address-2015-10-31/full-audio.wav
[MoviePy] Done.
[MoviePy] Writing audio in /Users/dtown/watson-word-watcher/projects/obama-weekly-address-2015-10-31/audio-segments/00000-00190.wav
[MoviePy] Done.
Transcribe
The biggest bottleneck is transcribing the audio. The transcribe.py script does all the transcription in one big go:
python transcribe.py projects/obama-weekly-address-2015-10-31
Sending to Watson API:
/Users/dtown/watson-word-watcher/projects/obama-weekly-address-2015-10-31/audio-segments/00000-00190.wav
Transcribed:
/Users/dtown/watson-word-watcher/projects/obama-weekly-address-2015-10-31/transcripts/00000-00190.json
And then run these scripts for a quickie processing of the JSON transcript:
python compile.py projects/obama-weekly-address-2015-10-31
python rawtext.py projects/obama-weekly-address-2015-10-31
python analyze.py projects/obama-weekly-address-2015-10-31
The output:
hi everybody today there are two point two million people behind bars in America and millions more on parole or probation
every year we spend eighty billion
in taxpayer dollars
keep people incarcerated
many are nonviolent offender serving unnecessarily long sentences
I believe we can disrupt the pipeline from underfunded schools overcrowded jails
I believe we can address the disparities in the application of criminal justice from arrest rates to sentencing to incarceration
and I believe we can help those who have served their time and earned a second chance
get the support they need to become productive members of society
that's why over the course of this year I've been talking to folks around the country about reforming our criminal justice system
to make it smarter fairer and more effective
in February I sat down in the oval office with police officers from across the country
in the spring
I met with police officers and young people in Camden New Jersey where they're using community policing and data to drive down crime
over the summer I visited a prison in Oklahoma to talk with inmates and correction officers about rehabilitating prisoners
preventing more people from ending up there in the first place
two weeks ago I visit West Virginia to meet with families battling prescription drug heroin abuse
as well as people who are working on new solutions for treatment and rehabilitation
last week I traveled to Chicago to thank police chiefs from across the country for all that their officers do to protect Americans
to make sure they get the resources they need to get the job done
and to call for common sense gun safety reforms that would make officers and their communities safe
we know that having millions of people in the criminal justice system without any ability to find a job after release is unsustainable
it's bad for communities and it's bad for our economy
so on Monday I'll travel to Newark New Jersey to highlight efforts to help Americans
paid their debt to society re integrate back into their communities
everyone has a role to play for businesses that are hiring ex offenders
to philanthropies they're supporting education and training programs
and I'll keep working with people in both parties to get criminal justice reform bills to my desk
including a bipartisan bill that would reduce mandatory minimums for nonviolent drug offenders and reward prisoners
shorter sentences if they complete programs that make them less likely
commit a repeat offense
there's a reason good people across the country are coming together to reform our criminal justice system
because it's not about politics
it's about whether we as a nation live up to our founding ideals of liberty and justice for all
and working together we can make sure that we do
thanks everybody have a great weekend and have a safe and happy Halloween