autotimecode

Video to aligned timecode(SRT), transcription and translation in 4 clicks.

Minimal intrusive to your current workflow. Granular API exposure. Modularized.

Run it!

Make sure you have Docker Compose installed: refer to https://docs.docker.com/compose/install/ for instruction. Of course you also need Docker - refer to https://docs.docker.com/install/ for instructions.

Configure CELERY_BROKER_URL and MONGO_URL in environment variable, and run

docker-compose build && docker-compose up

Wait till DeepSegment is loaded! and DeepCorrect is loaded! shows up. This may take longer on CPU machines.

API documentations are located in https://github.com/cnbeining/autotimecode/blob/master/autotimecode_api/README.MD .

Recommended Subtitle Workflow

Note this workflow is based on ACICFG's recommendation: adjust to fit you needs.

Get a rough version of timecode from video: covered in this project, /vad/ endpoint
Transcribe the video (with help of STT). Roughly edit the SRT to include any time range that may be missing from the 1st step: Model building is NOT the target of this project - check /stt/ endpoint for voice recognition helper.
From transcribed SRT, generate SRT with accurate timecode: /fa/ endpoint.
Continue on translation (maybe with help of Machine Translation): check /nmt/ endpoint.

Background

This project is solving 4 problems:

Given video, generate timecode on when human speech exists;
Given video and timecode, transcribe the video automatically;
Given rough timecode, generate accurate timecode aligned with video;
Given transcription, generate translation.

FAQ

Where is Speech to Text(STT)?

A STT helper is added at /stt/ endpoint.

STT model training is out of the scope of this project, as this project is focusing on timecode generating and aligning.

Why include Kaldi and ffmpeg twice in different images?

The target is that every segment of this project shall be reusable:
Those 2 Kaldies are not in the same version. Same reason I passed PyKaldi.

Docker Compose is taking a minute to come up!

TensorFlow Serving does not really mix with custom Keras layers.

How can I finetune your models?

Stay tuned.

Where is Japanese/Chinese/xxxese/xxxlish support?

The authors are working hard to make it happen. Again, stay tuned!

TODO

Multiple language support
Add Google Drive support
Add ASS download support

Authors

David Zhuang, https://www.cnbeining.com/ , https://github.com/cnbeining . Coded this thing. Productionalized the ML models involved.
Yuan-Hang Zhang, https://www.sailorzhang.com/ , https://github.com/sailordiary . Designed the ML algos.

The authors are member of, and acknowledge the help from ACICFG.

License

GPL 3.0. Please contact authors if you need licensing.

Please retrieve copies of licenses from respected repo links.

Gentle is written by @lowerquality, MIT license, https://github.com/lowerquality/gentle .

Kaldi is located at https://kaldi-asr.org/ , Apache 2.0 license.

ffsend.py was originally written by Robert Xiao ([email protected]), https://github.com/nneonneo/ffsend, and is licensed under the Mozilla Public License 2.0. If you have concern, remove this file and disable Firefox Send.

ffsend binary is provided by Tim Visée, https://github.com/timvisee/ffsend , GPL 3.0. If you have concern, please remove this file.

VAD engine is based on work of Hebbar, R., Somandepalli, K., & Narayanan, S. (2019). Robust Speech Activity Detection in Movie Audio: Data Resources and Experimental Evaluation. ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). doi: 10.1109/icassp.2019.8682532 . Original code can be retrieved at https://github.com/usc-sail/mica-speech-activity-detection .

txt2txt, deepcorrect and deepsegment were written by Bedapudi Praneeth, https://github.com/bedapudi6788 , GPL 3.0.

STT and NMT technologies are provided by Google.

Some STT code are originally from https://github.com/agermanidis/autosub , MIT license.

cnbeining/autotimecode

cnbeining

Reviews

Repository Details