audio - Google Speech Recognition API：每个单词的时间戳？

Question

可以使用 Google 的语音识别 API 来获取音频文件（WAV、MP3 等）的转录，方法是向http://www.google.com/speech-api/v2/recognize?...

示例：我在 WAV 文件中说过“一二三换五”。Google API 给了我这个：

{
  u'alternative':
  [
    {u'transcript': u'12345'},
    {u'transcript': u'1 2 3 4 5'},
    {u'transcript': u'one two three four five'}
  ],
  u'final': True
}

问题：是否有可能获得每个单词被说出的时间（以秒为单位）？

以我的例子：

['one', 0.23, 0.80], ['two', 1.03, 1.45], ['three', 1.79, 2.35], etc.

即“一”
这个词在时间 00:00:00.23 和 00:00:00.80 之间说，“二”这个词在时间 00:00:01.03 和 00:00:01.45 之间说（以秒为单位）。

PS：寻找支持英语以外的其他语言的API，尤其是法语。

score 15 · Accepted Answer

我相信另一个答案现在已经过时了。现在可以使用 Google Cloud Search API： https ://cloud.google.com/speech/docs/async-time-offsets

score 13 · Accepted Answer

编辑 2020：现在可能，请参阅其他答案

谷歌 API 是不可能的。

如果你想要单词时间戳，你可以使用其他 API，例如：

Vosk-API - 免费的离线语音识别 API（披露：我是 Vosk 的主要作者）。

SpeechMatics SaaS 语音识别 API

IBM 的语音识别 API

score 9 · Accepted Answer

是的，很有可能。您需要做的就是：

在配置集 enable_word_time_offsets=True

config = types.RecognitionConfig(
        ....
        enable_word_time_offsets=True)

然后，对于备选方案中的每个单词，您可以打印其开始时间和结束时间，如下代码所示：

for result in result.results:
        alternative = result.alternatives[0]
        print(u'Transcript: {}'.format(alternative.transcript))
        print('Confidence: {}'.format(alternative.confidence))

        for word_info in alternative.words:
            word = word_info.word
            start_time = word_info.start_time
            end_time = word_info.end_time
            print('Word: {}, start_time: {}, end_time: {}'.format(
                word,
                start_time.seconds + start_time.nanos * 1e-9,
                end_time.seconds + end_time.nanos * 1e-9))

这将为您提供以下格式的输出：

Transcript:  Do you want me to give you a call back?
Confidence: 0.949534416199
Word: Do, start_time: 1466.0, end_time: 1466.6
Word: you, start_time: 1466.6, end_time: 1466.7
Word: want, start_time: 1466.7, end_time: 1466.8
Word: me, start_time: 1466.8, end_time: 1466.9
Word: to, start_time: 1466.9, end_time: 1467.1
Word: give, start_time: 1467.1, end_time: 1467.2
Word: you, start_time: 1467.2, end_time: 1467.3
Word: a, start_time: 1467.3, end_time: 1467.4
Word: call, start_time: 1467.4, end_time: 1467.6
Word: back?, start_time: 1467.6, end_time: 1467.7

来源：https ://cloud.google.com/speech-to-text/docs/async-time-offsets

audio - Google Speech Recognition API：每个单词的时间戳？

3 回答 3

Related

Reference