voice2json
is a collection of command-line tools for offline speech/intent recognition on Linux. It is free, open source (MIT), and supports 18 human languages.
From the command-line:
$ voice2json -p en transcribe-wav \
< turn-on-the-light.wav | \
voice2json -p en recognize-intent | \
jq .
produces a JSON event like:
{
"text": "turn on the light",
"intent": {
"name": "LightState"
},
"slots": {
"state": "on"
}
}
when trained with this template:
[LightState]
states = (on | off)
turn (<states>){state} [the] light
voice2json
is optimized for:
- Sets of voice commands that are described well by a grammar
- Commands with uncommon words or pronunciations
- Commands or intents that can vary at runtime
It can be used to:
- Add voice commands to existing applications or Unix-style workflows
- Provide basic voice assistant functionality completely offline on modest hardware
- Bootstrap more sophisticated speech/intent recognition systems
Supported speech to text systems include:
- CMU's pocketsphinx
- Dan Povey's Kaldi
- Mozilla's DeepSpeech 0.9
- Kyoto University's Julius
Supported Languages
- Catalan (
ca
) - Czech (
cs
) - German (
de
) - Greek (
el
) - English (
en
) - Spanish (
es
) - French (
fr
) - Hindi (
hi
) - Italian (
it
) - Korean (
ko
) - Kazakh (
kz
) - Dutch (
nl
)nl_kaldi-cgn
(default)nl_kaldi-rhasspy
nl_pocketsphinx-cmu
- Polish (
pl
)pl_deepspeech-jaco
(default)pl_julius-github
- Portuguese (
pt
) - Russian (
ru
)ru_kaldi-rhasspy
(default)ru_pocketsphinx-cmu
- Swedish (
sv
)sv_kaldi-montreal
sv_kaldi-rhasspy
(default)
- Vietnamese (
vi
) - Mandarin (
zh
)
Unique Features
voice2json
is more than just a wrapper around open source speech to text systems!
- Training produces both a speech and intent recognizer. By describing your voice commands with
voice2json
's templating language, you get more than just transcriptions for free. - Re-training is fast enough to be done at runtime (usually < 5s), even up to millions of possible voice commands. This means you can change referenced slot values or add/remove intents on the fly.
- All of the available commands are designed to work well in Unix pipelines, typically consuming/emitting plaintext or newline-delimited JSON. Audio input/output is file-based, so you can receive audio from any source.
Commands
- download-profile - Download missing files for a profile
- train-profile - Generate speech/intent artifacts
- transcribe-wav - Transcribe WAV file to text
- Add
--open
for unrestricted speech to text
- Add
- transcribe-stream - Transcribe live audio stream to text
- Add
--open
for unrestricted speech to text
- Add
- recognize-intent - Recognize intent from JSON or text
- wait-wake - Listen to live audio stream for wake word
- record-command - Record voice command from live audio stream
- pronounce-word - Look up or guess how a word is pronounced
- generate-examples - Generate random intents
- record-examples - Generate and record speech examples
- test-examples - Test recorded speech examples
- show-documentation - Run HTTP server locally with documentation
- print-profile - Print profile settings
- print-downloads - Print profile file download information
- print-files - Print user profile files for backup