c# - 寻找发音正确性

Question

我需要借助 Microsoft 语音 SDK ( System.Speech.Recognition) 来识别用户发音的“质量”。我正在使用 MS Speech Engine - US，所以我真正需要的是找出说话者的声音与“北美”口音的接近程度。

一种方法是检查用户的声音与美国英语语音发音的接近程度。正如 MSDN 中提到的，这个过程似乎是由它自己在语音 SDK 中完成的，所以我需要把它弄出来。由于我们也可以自己为引擎设置语音，我相信这是可能的。

但是，我不清楚我必须做什么。那么，如何才能了解用户的发音质量/与美国北美英语语音发音的接近程度？用户只需说出预定义的句子，例如“Hello World。我在这里”。

更新

通过使用以下代码，我得到了某种“音素”（如 MSDN 中所述）

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Speech.Recognition;
using System.Speech.Synthesis;
using System.Windows.Forms;
using System.IO;

namespace US_Speech_Recognizer
{
    public class RecognizeSpeech
    {
        private SpeechRecognitionEngine sEngine; //Speech recognition engine
        private SpeechSynthesizer sSpeak; //Speech synthesizer
        string text3 = "";

        public RecognizeSpeech()
        {
            //Make the recognizer ready
            sEngine = new SpeechRecognitionEngine(new System.Globalization.CultureInfo("en-US"));


            //Load grammar
            Choices sentences = new Choices();
            sentences.Add(new string[] { "I am hungry" });

            GrammarBuilder gBuilder = new GrammarBuilder(sentences);

            Grammar g = new Grammar(gBuilder);

            sEngine.LoadGrammar(g);

            //Add a handler
            sEngine.SpeechRecognized +=new EventHandler<SpeechRecognizedEventArgs>(sEngine_SpeechRecognized);


            sSpeak = new SpeechSynthesizer();
            sSpeak.Rate = -2;



            //Computer speaks the words to get the phones
            Stream stream = new MemoryStream();
            sSpeak.SetOutputToWaveStream(stream);


            sSpeak.Speak("I was hungry");
            stream.Position = 0;
            sSpeak.SetOutputToNull();


            //Configure the recognizer to stream
            sEngine.SetInputToWaveStream(stream);

            sEngine.RecognizeAsync(RecognizeMode.Single);


        }


        //Start the speech recognition task
        private void sEngine_SpeechRecognized(object sender, SpeechRecognizedEventArgs e)
        {
            string text = "";

            if (e.Result.Text == "I am hungry")
            {
                foreach (RecognizedWordUnit wordUnit in e.Result.Words)
                {
                    text = text + wordUnit.Pronunciation + "\n";
                }

                MessageBox.Show(e.Result.Text + "\n" + text);
            }


        }
    }
}

这是与音素相关的直接代码片段（摘自以上代码）

   //Start the speech recognition task
    private void sEngine_SpeechRecognized(object sender, SpeechRecognizedEventArgs e)
    {
        string text = "";

        if (e.Result.Text == "I am hungry")
        {
            foreach (RecognizedWordUnit wordUnit in e.Result.Words)
            {
                text = text + wordUnit.Pronunciation + "\n";
            }

            MessageBox.Show(e.Result.Text + "\n" + text);
        }


    }

以下是我的输出。我得到的音素从第二行开始显示。第一行简单地显示了识别的句子

在此处输入图像描述

所以，请告诉我，根据 MSDN，这是“音素”。那么，这实际上是“音素”吗？我从来没有见过这些，这就是为什么。

上面的代码是根据这个链接完成的http://msdn.microsoft.com/en-us/library/microsoft.speech.recognition.srgsgrammar.srgstoken.pronunciation(v=office.14).aspx

score 4 · Accepted Answer

好的，这就是我解决问题的方法。

首先，使用 Pronunciation 主题加载听写引擎，这将返回用户说出的音素（在 Recognition 事件中）。

其次，使用ISpEnginePronunciation::GetPronunciations方法获取单词的参考音素（正如我在此处概述的那样）。

一旦你有了这两组音素，你就可以比较它们。本质上，音素由空格分隔，每个音素由一个短标签表示（在美国英语音素表示规范中描述）。

鉴于此，您应该能够通过使用任意数量的近似字符串匹配方案（例如Levenshtein distance）比较音素来计算分数。

通过比较电话 ID 而不是字符串，您可能会发现问题更简单；ISpPhoneConverter::PhoneToId可以将音素字符串转换为一组phoneID，每个音素一个ID。这将为您提供一对以空结尾的整数数组，也许更适合您的比较算法。

您可以使用引擎置信度来惩罚匹配，因为低引擎置信度表明传入的音频与引擎对音素的想法不匹配。

c# - 寻找发音正确性

更新

1 回答 1

Related

Reference