c# - 如何找到 SpeechSynthesizer 所选声音的音频格式

Question

在 C# 的文本到语音应用程序中，我使用SpeechSynthesizer类，它有一个名为的事件，它SpeakProgress会为每个说出的单词触发。但是对于某些声音，参数e.AudioPosition与输出音频流不同步，并且输出波形文件的播放速度比这个位置显示的要快（参见这个相关问题）。

无论如何，我正在尝试找到有关比特率的确切信息以及与所选语音相关的其他信息。正如我所经历的，如果我可以使用此信息初始化波形文件，同步问题将得到解决。但是，如果我在中找不到此类信息SupportedAudioFormat，我不知道有其他方法可以找到它们。例如，“Microsoft David Desktop”语音在中不提供支持的格式VoiceInfo，但它似乎支持 PCM 16000 hz、16 位格式。

如何找到 SpeechSynthesizer 所选声音的音频格式

 var formats = CurVoice.VoiceInfo.SupportedAudioFormats;

 if (formats.Count > 0)
 {
     var format = formats[0];
     reader.SetOutputToWaveFile(CurAudioFile, format);
 }
 else
 {
        var format = // How can I find it, if the audio hasn't provided it?           
        reader.SetOutputToWaveFile(CurAudioFile, format );
}

score 3 · Accepted Answer

更新：此答案已在调查后进行了编辑。最初，我从内存中建议 SupportedAudioFormats 可能只是来自（可能配置错误的）注册表数据；调查表明，对我来说，在 Windows 7 上，情况确实如此，并且在 Windows 8 上得到了备份。

支持的音频格式问题

System.Speech包装了古老的 COM 语音 API (SAPI)，一些声音是 32 位和 64 位的，或者可能配置错误（在 64 位机器的注册表上，HKLM/Software/Microsoft/Speech/Voices与HKLM/Software/Wow6432Node/Microsoft/Speech/Voices.

我已经将 ILSpy 指向了System.Speech它的VoiceInfo类，并且我非常确信 SupportedAudioFormats 完全来自注册表数据，因此SupportedAudioFormats如果您的 TTS 引擎没有为您的应用程序的平台目标正确注册（ x86、Any 或 64 位），或者如果供应商根本没有在注册表中提供此信息。

语音可能仍支持不同的、额外的或更少的格式，因为这取决于语音引擎（代码）而不是注册表（数据）。所以它可以在黑暗中拍摄。在这方面，标准 Windows 声音通常比第三方声音更一致，但它们仍然不一定有用地提供SupportedAudioFormats.

很难找到这些信息

我发现仍然可以获得当前语音的当前格式——但这确实依赖于反射来访问 System.Speech SAPI 包装器的内部。

因此，这是非常脆弱的代码！而且我不建议在生产中使用。

注意：以下代码确实需要您调用 Speak() 一次进行设置；在没有 Speak() 的情况下，需要更多的调用来强制设置。但是，我可以打电话Speak("")什么都不说，而且效果很好。

执行：

[StructLayout(LayoutKind.Sequential)]
struct WAVEFORMATEX
{
    public ushort wFormatTag;
    public ushort nChannels;
    public uint nSamplesPerSec;
    public uint nAvgBytesPerSec;
    public ushort nBlockAlign;
    public ushort wBitsPerSample;
    public ushort cbSize;
}

WAVEFORMATEX GetCurrentWaveFormat(SpeechSynthesizer synthesizer)
{
    var voiceSynthesis = synthesizer.GetType()
                                    .GetProperty("VoiceSynthesizer", BindingFlags.Instance | BindingFlags.NonPublic)
                                    .GetValue(synthesizer, null);

    var ttsVoice = voiceSynthesis.GetType()
                                 .GetMethod("CurrentVoice", BindingFlags.Instance | BindingFlags.NonPublic)
                                 .Invoke(voiceSynthesis, new object[] { false });

    var waveFormat = (byte[])ttsVoice.GetType()
                                     .GetField("_waveFormat", BindingFlags.Instance | BindingFlags.NonPublic)
                                     .GetValue(ttsVoice);

    var pin = GCHandle.Alloc(waveFormat, GCHandleType.Pinned);
    var format = (WAVEFORMATEX)Marshal.PtrToStructure(pin.AddrOfPinnedObject(), typeof(WAVEFORMATEX));
    pin.Free();

    return format;
}

用法：

SpeechSynthesizer s = new SpeechSynthesizer();
s.Speak("Hello");
var format = GetCurrentWaveFormat(s);
Debug.WriteLine($"{s.Voice.SupportedAudioFormats.Count} formats are claimed as supported.");
Debug.WriteLine($"Actual format: {format.nChannels} channel {format.nSamplesPerSec} Hz {format.wBitsPerSample} audio");

为了测试它，我在下重命名了 Microsoft Anna 的AudioFormats注册表项HKLM/Software/Wow6432Node/Microsoft/Speech/Voices/Tokens/MS-Anna-1033-20-Dsk/Attributes，导致SpeechSynthesizer.Voice.SupportedAudioFormats查询时没有元素。以下是这种情况下的输出：

0 formats are claimed as supported.
Actual format: 1 channel 16000 Hz 16 audio

score 0 · Accepted Answer

您无法从代码中获取此信息。您只能收听所有格式（从 8 kHz 之类的劣质格式到 48 kHz 之类的高质量格式）并观察它停止变得更好的地方，我认为这就是您所做的。

在内部，语音引擎只向语音“询问”原始音频格式一次，我相信这个值只在语音引擎内部使用，语音引擎不会以任何方式暴露这个值。

了解更多信息：

假设您是一家语音公司。您已经录制了 16 kHz、16 位、单声道的计算机语音。

用户可以让您的声音以 48 kHz、32 位、立体声说话。语音引擎执行此转换。语音引擎并不关心它是否真的听起来更好，它只是进行格式转换。

假设用户想让你的声音说话。他要求将文件保存为 48 kHz、16 位、立体声。

SAPI / System.Speech 使用此方法调用您的声音：

STDMETHODIMP SpeechEngine::GetOutputFormat(const GUID * pTargetFormatId, const WAVEFORMATEX * pTargetWaveFormatEx,
GUID * pDesiredFormatId, WAVEFORMATEX ** ppCoMemDesiredWaveFormatEx)
{
    HRESULT hr = S_OK;

    //Here we need to return which format our audio data will be that we pass to the speech engine.
    //Our format (16 kHz, 16 bit, mono) will be converted to the format that the user requested. This will be done by the SAPI engine.

    enum SPSTREAMFORMAT sample_rate_at_which_this_voice_was_recorded = SPSF_16kHz16BitMono; //Here you tell the speech engine which format the data has that you will pass back. This way the engine knows if it should upsample you voice data or downsample to match the format that the user requested.

    hr = SpConvertStreamFormatEnum(sample_rate_at_which_this_voice_was_recorded, pDesiredFormatId, ppCoMemDesiredWaveFormatEx);

    return hr;
}

这是您必须“揭示”您声音的录制格式的唯一地方。

所有“可用格式”都会告诉您您的声卡/Windows 可以进行哪些转换。

希望我解释清楚？作为语音供应商，您不支持任何格式。您只需告诉他们语音引擎您的音频数据是什么格式，以便它可以进行进一步的转换。

c# - 如何找到 SpeechSynthesizer 所选声音的音频格式

2 回答 2

支持的音频格式问题

很难找到这些信息

Related

Reference