This is not a very specific programming question but I am looking for a way to free my girlfriend (a video editor) from 50% repetitive work that might be helped with coding. So thanks a lot in advance for any innovative advice.
She is working on kinetics typography, behind the scene is actually two very repetitive steps. 1. Position a work in the layout and 2. line up the word (a red track in the bottom) with the voice track. Then the whole video is basically the combinations of these two steps... again and again.
Right now, she is slowing down the voice dramatically to manually match the start time of a word with the time point that the word exist in the voice track.
What I want:
Is there any mature tool (Python/R interface) that can do voice recognition. So I have a voice file (mp3/wmv), then it will generate the text file with the content in that voice file.
Would it be possible to match the word with the timepoint where it appears in the voice... So in this case, the output of the python script should be:
recognition starting so 100ms I 110ms have 120ms been 135ms ...
Something similar like the caption
feature from Youtube but single word based...
I know there won't be a perfect solution but appreciate any advice or suggestion so part of this boring trip could be pragmatically solved.