• Stars
    star
    309
  • Rank 134,532 (Top 3 %)
  • Language
    Python
  • License
    Other
  • Created over 8 years ago
  • Updated about 8 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A proof of concept using IBM's Speech-to-Text API to do quick-and-dirty transcriptions

Using IBM Watson's Speech-to-Text API to do Multi-Threaded Transcription of Really Long and Talky Videos, Such as Presidential Debates

A demonstration of how to use Python and IBM Watson's Speech-to-Text API to do some decently accurate transcription of real-world video and audio, at amazingly fast speeds.

Note:_ I'm just spit-balling code here, not making a user-friendly package. I'm focused on making an automated workflow to create fun supercuts of "The Wire"...and will polish the scripts and implementation later. These notes and scripts (and data files) are merely for your reference.

tl;dr

IBM Watson offers a REST-based Speech to Text API that allows free usage for the first 1,000 minutes each month (and $0.02 for each additional minute):

Watson Speech to Text can be used anywhere there is a need to bridge the gap between the spoken word and its written form. This easy-to-use service uses machine intelligence to combine information about grammar and language structure with knowledge of the composition of an audio signal to generate an accurate transcription. It uses IBM's speech recognition capabilities to convert speech in multiple languages into text. The transcription of incoming audio is continuously sent back to the client with minimal delay, and it is corrected as more speech is heard.

In my preliminary tests, it's not quite as good as Google Translate in terms of pure accuracy, but it's more than good enough for finding key words, whether they be relatively common verbs like "fight", "death", "kill" or proper nouns, such as Obama and countries of the world.

But it doesn't do too badly on very common (and aurally-ambiguous) short words such as pronouns and articles. Because Watson provides a confidence level for each word, it's possible to write scripts to programmatically filter out ambiguous words.

Here's a YouTube playlist of some automated supercuts I've created from the U.S. presidential primary debates. My favorite is probably this supercut of Senator Sanders and Secretary Clinton saying fighting words.

Here's the JSON returned from Watson, which includes word-by-word timestamps and confidence levels. Here's a simplified version of it, in which the JSON is just a flat list of words.

IBM Watson's API is robust enough to accept many concurrent requests. In the sample scripts I've included in this repo, I was able to break up a 90 minute debate into 5 minute segments and send them up to Watson simultaneously...resulting in a 6 to 7 minute processing time for the entire 90 minutes.

Some non-presidential examples:

Quick *nix check!

Before you look at the scary Python framework I've built for myself, you should first if you can work with movie/audio files and connect to Watson, using nothing but Unix tools: ffmpeg, and good ol' curl: check out this brief walkthrough

Supercut fun

You probably want to see the final product. I'm too lazy to document all the code and haven't organized it yet, but here's one result: making supercuts by grepping the Watson Speech to Text data for certain words. For example, to find all "fighting words", e.g. war, wars, warriors, fight, bomb, kill, threat, terror, death, murder, torture:

 python supercut.py republican-debate-sc-2016-02-13 '\bwar(?:riors?|s)?\b|fight|bomb|kill|threat|terror|death|murder|tortur'

Here's a playlist of sample supercuts of presidential people:

Republican Debate, South Carolina, 2016-02-13:

Democratic Debate, Wisconsin, 2016-02-11:

Obama weekly address


The technical details

How it works

After you've downloaded a video file to disk, the assorted scripts and commands in this repo will:

  1. Convert the file to mp4 if necessary
  2. Create a project subfolder to store the video file and all derived audio and transcripts file
  3. Extract the audio from as 16-bit, 16khz WAV files
  4. Split the audio into segments (300 seconds each, by default)
  5. Send each of those segments to Watson's API to be analyzed and transcribed.
  6. Saves the raw responses from Watson's API for each audio file
  7. Compiles all of the resulting responses into one data file, as if you had sent the entire audio file to be analyzed in a single go.

The advantages of splitting up the audio is that it allows the transcription to be done in parallel. An hour-long audio track would take probably an hour to get a response back (if your internet connection doesn't fail), whereas 60 parallel requests to analyze 1-minute each will take roughly...1 minute to complete.

I haven't tested the upper-bounds in concurrent requests to Watson's API, though I was able to send around 30 5-minute requests all at once without getting an errors.

Here are some sample results in the projects/ folder:

Requirements

IBM Watson

The transcription power comes from IBM Watson's Speech-to-Text REST API. After cutting up a video into 5-minute segments, I then upload all of the audio files in parallel to Watson, which can complete the entire batch in nearly just 5 minutes.

Getting started with IBM Bluemix

You have to sign up for an IBM Bluemix account, which is free and doesn't require a credit card for the first month.

After signing up for Bluemix, you can find the console page for the speech-to-text API here, where you can get user credentials. This repo contains a sample file: credsfile_watson.SAMPLE.json

The pricing is pretty generous, in terms of testing things out: 1,000 minutes free each month. Every additional minute is $0.02 -- i.e. transcribing an hour's worth of audio will cost $1.20.

Quickie Watson Testy!

Before you get into the Python stuff, you should see if you are properly initialized with Watson by making contact with it from the command-line (i.e. bash, i.e. uh not sure if it will work on Windows like this):

If you don't have a WAV file at hand, you can install the youtube-dl command-line tool:

$ pip install youtube-dl

And then download Trump's Live Free or Die commercial. The following command downloads a movie file, bb4TxjvQlh0.mkv, and extracts a WAV file named bb4TxjvQlh0.wav:

youtube-dl "https://www.youtube.com/watch?v=bb4TxjvQlh0" \
  --keep-video \
  --extract-audio \
  --audio-format wav \
  --audio-quality 16K \
  --id

In the next step, I assume you have a file named bb4TxjvQlh0.wav, but you are free to use any WAV audio file.

(Note: the whole movie-file thing is totally ancillary...Watson doesn't care if the audio file comes from a movie or you recording into your microphone or whatever. But people like to transcribe videos, which is why I include the step.)

This next step is what contacts Watson's API. Replace USERNAME and PASSWORD with whatever credentials you got from the IBM Bluemix Developer Panel.

The --data-binary flag wants a file name (prepended with @).

When the audio file is uploaded and Watson returns a response, it will be saved to transcript.json

curl -X POST \
     -u USERNAME:PASSWORD     \
     -o transcript.json        \
     --header "Content-Type: audio/wav"    \
     --header "Transfer-Encoding: chunked" \
     --data-binary "@bb4TxjvQlh0.wav"        \
     "https://stream.watsonplatform.net/speech-to-text/api/v1/recognize?continuous=true&timestamps=true&word_confidence=true&profanity_filter=false"

If this doesn't work for you, then either your Internet is down, Watson is down, or you don't have the proper user/password credentials.

Python stuff

This project uses:

  • Anaconda 3-2.4.0
  • Python 3.5.1
  • Requests
  • moviepy - currently, just being used as a very nice wrapper around ffmpeg, to do audio-video conversion and extraction. But has a lot of potential for laughter and games via programmatic editing.
    • moviepy will install ffmpeg if you don't already have it installed

Demonstrations

Republican Debate in South Carolina, Feb. 13, 2016

Check out the projects/republican-debate-sc-2016-02-13 folder in this repo to see the raw JSON response files and their corresponding .WAV audio, as extracted from the Feb. 13, 2016 Republican Presidential Candidate debate in South Carolina:

debate video on youtube

Donald Trump "Live Free or Die" commercial (39 seconds)

The commercial can be seen here on YouTube:

trumpvideo

The project directory generated: projects/trump-nh/

Because the video is so short, the directory includes the video file, the extracted audio, as well as the segmented audio and raw Watson JSON responses. For this example, I made the segments 10 seconds long.

To compile the transcript text:

import json
from glob import glob
filenames = glob("./projects/trump-nh/transcripts/*.json")

for fn in filenames:
  with open(fn, 'r') as t:
      data = json.loads(t.read())
      for x in data['results']:
          best_alt = x['alternatives'][0]
          print(best_alt['transcript'])

The result:

this great slogan of the Hampshire live free or die means so much

so many people all over the world they use that expression it means liberty it means freedom it means free enterprise

mean safe

the insecurity it means borders it means strong strong military where nobody's going to mess with us it means taking care of our vets

what a great slogan congradulations New Hampshire

wonderful job dnmt I and

Note that the last 3 tokens, dmnt I and, are a result of the Watson API getting confused by the dramatic music that closes the commercial. Luckily, the JSON response includes, among timestamp data for each work, a confidence level as well.

It actually is spot on for Trump's full closing sentence (not sure why "congradulations" is used...)...the confidence levels for dmnt I and were very low comparatively...I think dmnt is some kind of code word used by the API to indicate something, not that Watson thought that dmnt was actually said (see the full JSON response here)

{
    "word_confidence": [
        [
            "what",
            0.9999999999999674
        ],
        [
            "a",
            0.9999999999999672
        ],
        [
            "great",
            0.999999999999967
        ],
        [
            "slogan",
            0.9964234383591973
        ],
        [
            "congradulations",
            0.7798716606178608
        ],
        [
            "New",
            0.9999999999999933
        ],
        [
            "Hampshire",
            0.9845177369977128
        ]
    ]
}

President Obama weekly address for October 31, 2015 (3 minutes)

Here's a quick demonstration of Watson's accuracy given a weekly video address from President Obama (~3 minutes):

(because President Obama's video address is just about 3 minutes long, only audio file is extracted, and only one call to Watson's API is made)

Right now there's just a bunch of sloppy scripts that need to be refactored. There's a script named init.py that you can run from the command-line that will read an existing video file, create a project folder, cut up the audio, and do the transcriptions. It assumes that you have a file named credsfile_watson.json relative to init.py.

Some code for the commandline, to download the file, then to run init.py:

curl -o "/tmp/obama-weekly-address-2015-10-31.mp4" \
  https://www.whitehouse.gov/WeeklyAddress/2015/103115-QREDSC/103115_WeeklyAddress.mp4

python init.py /tmp/obama-weekly-address-2015-10-31.mp4

The output produced by init.py:

[MoviePy] Writing audio in /Users/dtown/watson-word-watcher/projects/obama-weekly-address-2015-10-31/full-audio.wav
[MoviePy] Done.                                                                                            
[MoviePy] Writing audio in /Users/dtown/watson-word-watcher/projects/obama-weekly-address-2015-10-31/audio-segments/00000-00190.wav
[MoviePy] Done.  

Transcribe

The biggest bottleneck is transcribing the audio. The transcribe.py script does all the transcription in one big go:

python transcribe.py projects/obama-weekly-address-2015-10-31
Sending to Watson API:
   /Users/dtown/watson-word-watcher/projects/obama-weekly-address-2015-10-31/audio-segments/00000-00190.wav
Transcribed:
   /Users/dtown/watson-word-watcher/projects/obama-weekly-address-2015-10-31/transcripts/00000-00190.json

And then run these scripts for a quickie processing of the JSON transcript:

python compile.py projects/obama-weekly-address-2015-10-31
python rawtext.py projects/obama-weekly-address-2015-10-31
python analyze.py projects/obama-weekly-address-2015-10-31

The output:

hi everybody today there are two point two million people behind bars in America and millions more on parole or probation

every year we spend eighty billion

in taxpayer dollars

keep people incarcerated

many are nonviolent offender serving unnecessarily long sentences

I believe we can disrupt the pipeline from underfunded schools overcrowded jails

I believe we can address the disparities in the application of criminal justice from arrest rates to sentencing to incarceration

and I believe we can help those who have served their time and earned a second chance

get the support they need to become productive members of society

that's why over the course of this year I've been talking to folks around the country about reforming our criminal justice system

to make it smarter fairer and more effective

in February I sat down in the oval office with police officers from across the country

in the spring

I met with police officers and young people in Camden New Jersey where they're using community policing and data to drive down crime

over the summer I visited a prison in Oklahoma to talk with inmates and correction officers about rehabilitating prisoners

preventing more people from ending up there in the first place

two weeks ago I visit West Virginia to meet with families battling prescription drug heroin abuse

as well as people who are working on new solutions for treatment and rehabilitation

last week I traveled to Chicago to thank police chiefs from across the country for all that their officers do to protect Americans

to make sure they get the resources they need to get the job done

and to call for common sense gun safety reforms that would make officers and their communities safe

we know that having millions of people in the criminal justice system without any ability to find a job after release is unsustainable

it's bad for communities and it's bad for our economy

so on Monday I'll travel to Newark New Jersey to highlight efforts to help Americans

paid their debt to society re integrate back into their communities

everyone has a role to play for businesses that are hiring ex offenders

to philanthropies they're supporting education and training programs

and I'll keep working with people in both parties to get criminal justice reform bills to my desk

including a bipartisan bill that would reduce mandatory minimums for nonviolent drug offenders and reward prisoners

shorter sentences if they complete programs that make them less likely

commit a repeat offense

there's a reason good people across the country are coming together to reform our criminal justice system

because it's not about politics

it's about whether we as a nation live up to our founding ideals of liberty and justice for all

and working together we can make sure that we do

thanks everybody have a great weekend and have a safe and happy Halloween

You can compare it to the transcript here.

More Repositories

1

journalism-syllabi

Computer-Assisted Reporting and Data Journalism Syllabuses, compiled by Dan Nguyen
Python
165
star
2

abbyy-finereader-ocr-senate

Evaluating the performance and accuracy of ABBYY FineReader's OCR on Senate Financial Disclosure scanned forms
CSS
125
star
3

github-for-portfolios

A layperson's step-by-step guide to building webpages with Github
CSS
73
star
4

python-notebooks-data-wrangling

Python 3.x notebooks about real-world data cleaning and visualization
Jupyter Notebook
68
star
5

facebook-trending-rss-fetcher

Python code to scrape and collect data from the RSS feeds Facebook uses to augment its Trending Section
Python
56
star
6

smalldata_journalism

An online reference for data journalism
Ruby
25
star
7

learn-data-csv-cli

A work-in-progress guide showing how and why you should learn command-line tools (xsv, csvkit) to work with data
Python
19
star
8

bashfoo

My personally curated list of bash/command-line commands and snippets that are very useful yet I keep on forgetting
Python
18
star
9

datajournalism-primer

a general list of resources and articles for people interested in getting into data journalism
HTML
16
star
10

congress-colleges

What fancy schools do U.S. legislators go to?
HTML
15
star
11

gis-geospatial-fun-python3x

Tracking my progress in doing GIS/Geospatial work in Python 3.x
Jupyter Notebook
12
star
12

nicar-2019-pdfplumbing

NICAR 2019 workshop on using Python and PDFplumber to extract text from PDFs
Jupyter Notebook
12
star
13

Congressmiles

A tutorial on using Face.com's and NYT Congress's API + Sunlight data
Ruby
10
star
14

dannguyen.github.io

I'm making a Github Pages repo!
HTML
9
star
15

scrape-senate-financial-disclosures

looking at U.S. Senators' disclosures, including how to parse and track them
HTML
9
star
16

local-news-data

how hard is it to get a list of all local news sites in the United States (LOL)
Python
8
star
17

python-at-stanford

Python Courses at Stanford
8
star
18

NICAR-Google-Refine

The lesson and source files for Dan Nguyen's NICAR 2012 lesson on Google Refine
6
star
19

pdftotablestable

Comparing the programs that extract tabular data from PDFs, e.g. ABBYY FineReader, Tabula, CometDocs
6
star
20

house-financial-disclosures

Scraping House representative financial disclosures
Python
6
star
21

clinton-hillary-email-fbi-investigation-docs

OCR copy of the 2015-2016 FBI Investigation into Hillary Clinton's emails
6
star
22

pydataproject-template

dan's personal reference for properly creating an empty/fresh python-based data wrangling project
Python
5
star
23

padjo-2017-sql-exam

PADJO 2017 SQL Exam - Now with extra election and disbursement data!
Shell
5
star
24

aws-textract-pdf-to-csv-demo

Testing the new AWS Textract when it comes to extracting data tables from PDFs (pdf-to-csv) and whether it can deliver us from our endless torments
5
star
25

nhtsa-complaint-data

Some scripts/data description for NHTSA complaint data
5
star
26

quickdataproject-template

a template I use for quick data project examples where collection, wrangling, and exploration can be done by standalone shell/python scripts
Python
5
star
27

screencappy

A command-line tool for making it easier to create and save screenshots as a blogger
Python
4
star
28

dmv-vanity-plate-rejections

A repo of collected data and records from U.S. state DMVs regarding rejected vanity license plates
HTML
4
star
29

csvkitcat

csvkitcat has been archived (Oct. 2020), and is being carted over to csvmedkit
Python
4
star
30

frozen.analytics.usa.gov

A "frozen" version of https://analytics.usa.gov to practice network traffic inspection and web scraping
CSS
4
star
31

writhub

A simple Python-based static post generator, because I just need to post, not make an entire website
Python
4
star
32

journaling-on-github

My personal repo for doing quick journaling on Github with Markdown, plus some helper TOC scripts
Python
4
star
33

acp-2017-finding-stories-in-data

"How to Find Stories in Data" for the Associated Collegiate Press 2017 San Francisco Midwinter Convention
4
star
34

kfc-scrape

chicken
3
star
35

til

A simple static Jekyll blog of things I've learned, day-to-day, particularly in programming and data journalism
Ruby
3
star
36

altair-dataviz

Visualization in Python with the Altair library. Done in Jupyter Notebooks.
Jupyter Notebook
3
star
37

mechanical-unmurk-ocr

For the OCRing of scanned, murky documents where privacy, speed, accuracy, and cost are all priorities
3
star
38

seeing-is-beliebing

Instagram util for finding photos taken shorty before and after near where another photo was taken
JavaScript
3
star
39

simplestuff-sqlite

A data/lesson repo teaching SQL syntax and concepts with a very simple SQLite database
Shell
3
star
40

smalldata

A list of small datasets for examples of exploration in spreadsheets
Python
3
star
41

cms_medicare_fee_data

Data notebook for CMS Medicare fee data
3
star
42

marktoc

A Python library for generating a table of contents and anchor markup for a Markdown file
Python
3
star
43

sf-shelter-waitlist-daily-snapshots

A compilation of daily snapshots of San Francisco's emergency shelter reservation wait-list during the COVID-19 pandemic
Python
3
star
44

seshkit

seshkit is a command-line tool for creating transcripts from audio files
Python
3
star
45

excsv

goofin around with a command-line utility for quickly inspecting CSV files
Python
3
star
46

merle

A command-line tool for getting meta information from a URL
Python
2
star
47

DepGal

Build out a gal using RMagick
JavaScript
2
star
48

csvviz

please i would like someday a tool that is like csvkit but for making charts from the command line
Python
2
star
49

supcli

supcli: my personal guide to modern CLI, including third-party replacement for classic Nix tools
2
star
50

xkcd-on-reactjs

Just playing around with React.js to make a searchable xkcd archive
Ruby
2
star
51

yearbook

Ruby
2
star
52

ny-gis-cartodb-fun

Examples of GIS with New York data and CartoDB
2
star
53

sf-ethics-lobbyist-sql

A repo of San Francisco lobbyist data compiled into SQLite form, including data-handling scripts
Shell
2
star
54

emojicsv

Machine-readable emotions in machine-readable CSV
HTML
2
star
55

command-line-basics-mz2022

command line lessons for 2022 quickie repo
2
star
56

SCOTUS-Transcript-Viewer

A Backbone.js viewer of SCOTUS transcripts
JavaScript
2
star
57

Shakyspeare

Analyzing the Bard's work with Ruby!
2
star
58

death-data

2
star
59

air_skift

Air rails
Ruby
2
star
60

bts-transstats-t100-domestic-demo

Demo of data processing for BTS transtats
2
star
61

middleman-meta-tags

Meta and SEO tag helpers for Middleman
Ruby
2
star
62

city_crime_data

collecting crime report data from cities that have it in a granular format
Makefile
2
star
63

bashappy_helpers

A bunch of helper functions I wrote to use for my own macOS terminal convenience
Shell
2
star
64

secdataexploring

fetching and exploring SEC structured data for fun
Python
2
star
65

dod-leso-1033-data

A repo for collecting data/records regarding the Defense Logistics
Python
2
star
66

matplotlib-styling-tutorial

A quick iPython notebook showing how to create and style Matplotlib charts with roughly same flexibility as ggplot2
Jupyter Notebook
2
star
67

texas-state-salaries

playing around with texas state salary data courtesy of the Texas Tribune
Python
2
star
68

healthcare.gov

A copy of healthcare.gov when it was built on Jekyll, before they removed the source code
JavaScript
2
star
69

jekyll-datasite-template

Trying to make a template that scaffolds a basic jekyll site with bootstrap and vendor d3v5
JavaScript
2
star
70

pgark

pgark (page archiver): Python library and CLI for archiving URLs on popular services like Wayback Machine [alpha, just spitballing]
Python
2
star
71

nature-inspired-algorithms-in-python

Going through Jason Brownlee's "Clever Algorithms: Nature-Inspired Programming Recipes" http://cleveralgorithms.com/nature-inspired/stochastic/random_search.html
Python
1
star
72

lookups-of-note

Lookup tables and data references
1
star
73

censusscout

making my own lightweight version of Census Explorer because y not
JavaScript
1
star
74

motherfuckingwebdesignguide

just do it
1
star
75

foodscrape

A demonstration of scraping health inspection websites and doing statistical analysis
1
star
76

nicar-2019-github-intro

Intro to git and github for journalists
Makefile
1
star
77

Sinatra-Fun

Testing out sinatra
Ruby
1
star
78

jekyll-bootstrap-starter

a basic jekyll theme that sits atop of Bootstrap 4.x. For my convenience only
HTML
1
star
79

sql2md

A bash script for converting SQLite query into Markdown-ready-pastable results
Shell
1
star
80

poynter-census-data-2019

Poynter Census Data Workshop 2019, using Sphinx-hieroglyph slidemaker
Python
1
star
81

data-wrangling-fakebook

The Little Data Wrangling Fakebook
Python
1
star
82

foiastories

a curated list of interesting foia/foil requests
1
star
83

astronautdata

A repo of astronaut data
HTML
1
star
84

danssphinx-template

This is a bunch of examples of things I forget how to do in Sphinx and reST
Python
1
star
85

stanford-public-affairs-data-journalism

1
star
86

sf-evictions

just collecting san francisco evictions data
Python
1
star
87

d3choro-template

yaddaydaydayda
CSS
1
star
88

merde

Shit
1
star
89

twitkit

yet another attempt at making a personal twitter data exploration command-line tool
Python
1
star
90

digital-jo-2017

Quickie repo for digital journalism notes for stanford journalism 2017
1
star
91

wire-glossary

the fuck did I do
Ruby
1
star
92

high-charty

JavaScript
1
star
93

fatal-encounters-and-census-sql

SQLite database exercises for analyzing Fatal Encounters (police officer involved homicides) and Census data
Shell
1
star
94

hello-svelte

need to practice this javascript thing
HTML
1
star
95

wikipedia-trends

1
star
96

revelecture

A command-line tool to turn Markdown files into Reveal.js powered slideshows
JavaScript
1
star
97

ok-earthquakes-RNotebook

Using R's ggplot2 and rgdal to examine earthquake activity in Oklahoma
R
1
star
98

python-audio-playtime

experimenting with Python audio visualizers and extraction libraries
1
star
99

air_flight_data

Skift Air
Ruby
1
star
100

scrapespeare

A collection of The Bard's text for basic programming exercises and data mining.
XSLT
1
star