python - python中的vosk：获取音频文件中转录文本的位置

Question

使用与 Vosk 存储库中的 test_ffmpeg.py 非常相似的文件，我正在探索可以从音频文件中获取哪些文本信息。

这是我正在使用的整个脚本的代码。

#!/usr/bin/env python3

from vosk import Model, KaldiRecognizer, SetLogLevel
import sys
import os
import wave
import subprocess
import json

SetLogLevel(0)

if not os.path.exists("model"):
    print ("Please download the model from https://alphacephei.com/vosk/models and unpack as 'model' in the current folder.")
    exit (1)

sample_rate=16000
model = Model("model")
rec = KaldiRecognizer(model, sample_rate)

process = subprocess.Popen(['ffmpeg', '-loglevel', 'quiet', '-i',
                            sys.argv[1],
                            '-ar', str(sample_rate) , '-ac', '1', '-f', 's16le', '-'],
                            stdout=subprocess.PIPE)

file = open(sys.argv[1]+".txt","w+")

while True:
    data = process.stdout.read(4000)
    if len(data) == 0:
        break
    if rec.AcceptWaveform(data):
        file.write(json.loads(rec.Result())['text']+"\n\n")
        #print(rec.Result())
    #else:
        #print(rec.PartialResult())
#print(json.loads(rec.Result())['text'])
file.write(json.loads(rec.Result())['text'])
file.close()

此示例运行良好，但是，我可以从 rec.PartialResult() 和 rec.Result() 中找到的唯一返回是带有结果的字符串字典。有没有办法查询 KaldiRecognizer 在音频文件中找到单个单词的时间？

当我输入这个时，我已经在考虑详细说明结果，并检测部分结果与当前样本相比的变化会给我想要的东西，但我把它贴在这里以防万一它已经实施的。

score 0 · Accepted Answer

经过一些测试，很明显 ffmpeg 的输出相对于定义的采样率（16000）似乎足够稳定，并且 4000 的读取字节结果是八分之一秒。我在 while 循环中创建了一个计数器，并根据采样率将其除以一个常数。如果您将参数更改为 ffmpeg，它可能会抛出此问题。

我使用了一些非常石器时代的字符串比较，仅在部分结果更改时打印，并且仅包含添加的新字符。

counter = 0
countinc = 2000/sample_rate
lastPR = ""
thisPR = ""
while True:
    data = process.stdout.read(4000)
    counter += 1
    if len(data) == 0:
        break
    rec.AcceptWaveform(data)
    thisPR = json.loads(rec.PartialResult())['partial']
    if lastPR != thisPR:
        print(counter*countinc,thisPR[len(lastPR):len(thisPR)])
        lastPR = thisPR

python - python中的vosk：获取音频文件中转录文本的位置

1 回答 1

Related

Reference