1

我正在构建一个应用程序,以使用启用了扬声器分类的 Google Speech-To-Text转录实时音频流(有关背景,请参阅先前的问题:123)。理想情况下,输出应如下所示:

00:00, speaker 1: 'Hello Peter, how old are you?'
00:08, speaker 2: 'Hello Mary, I am 20 years old.'
00:14, speaker 1: 'Where do you live?'
00:19, speaker 2: 'I live in New York.'

虽然我当前的 Google STT 设置可以相对较好地转录输入音频,但扬声器分类并没有像我预期的那样工作。谷歌会在每次回复中发送完整的文字记录,但每次说话人标签(说话人 1 和说话人 2)都会因先前识别的文本而改变。我已经实现了 Google 的示例 Python 脚本:

    # google speech client is configured and instantiated before this

    response = client.recognize(config=config, audio=audio)
        
        result = response.results[-1]
        
        words_info = result.alternatives[0].words
        
        for word_info in words_info:
            print(
                u"word: '{}', speaker_tag: {}".format(word_info.word, word_info.speaker_tag)
            )

这是第一个响应的示例输出:

word: 'hey', speaker_tag: 1
word: 'Peter', speaker_tag: 1
word: 'hello', speaker_tag: 2
word: 'Mary', speaker_tag: 2

但下一个回应给出:

word: 'hey', speaker_tag: 1
word: 'Peter', speaker_tag: 1
word: 'hello', speaker_tag: 1
word: 'Mary', speaker_tag: 1
word: 'how', speaker_tag: 2
word: 'are', speaker_tag: 2
word: 'you', speaker_tag: 2
word: 'doing', speaker_tag: 2

当新音频作为输入到达时,模型是否会不断更新?如果是这样,在一个音频流中创建具有多个扬声器的转录服务的好方法是什么?

我不指望银弹,但希望有人能指出我正确的方向。

4

1 回答 1

0

你应该添加这个:::enable_word_time_offsets=True

    encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
    sample_rate_hertz=44100,
    language_code="en-US",
    enable_speaker_diarization=True,
    enable_word_time_offsets=True,
    diarization_speaker_count=2,
)

for word_info in words_info:
    print(
        u"word: '{}', speaker_tag: '{}', start_time: '{}', 'end_time: '{}'".format(word_info.word, word_info.speaker_tag, word_info.start_time, 
                                                                                   word_info.end_time)
    )
于 2021-07-26T12:31:24.233 回答