c# - 如何在 C# 中实现 Viterbi 算法来拆分连词？

Question

简而言之 - 我想将这里问题的第一个答案从 Python 转换为 C#。我目前拆分连词的解决方案是指数级的，我想要一个线性解决方案。我假设我的输入文本中没有间距和一致的大小写。

背景

我希望将诸如“wickedweather”之类的连体字符串转换为单独的单词，例如使用 C# 的“wicked weather”。我创建了一个可行的解决方案，一个使用指数时间的递归函数，这对于我的目的来说根本不够有效（处理至少超过 100 个连接词）。这里是我到目前为止阅读的问题，我认为这可能会有所帮助，但我无法将他们的回答从 Python 转换为 C#。

我当前的递归解决方案

这适用于只想在 C# 中拆分几个单词（< 50）并且并不真正关心效率的人。

我当前的解决方案计算出所有可能的单词组合，找到最可能的输出和显示。我目前将最可能的输出定义为使用最长单个单词的输出 - 我更愿意使用不同的方法。这是我当前的解决方案，使用递归算法。

static public string find_words(string instring)
    {
        if (words.Contains(instring)) //where words is my dictionary of words
        {
            return instring;
        }
        if (solutions.ContainsKey(instring.ToString()))
        {
            return solutions[instring];
        }

        string bestSolution = "";
        string solution = "";

        for (int i = 1; i < instring.Length; i++)
        {
            string partOne = find_words(instring.Substring(0, i));
            string partTwo = find_words(instring.Substring(i, instring.Length - i));
            if (partOne == "" || partTwo == "")
            {
                continue;
            }
            solution = partOne + " " + partTwo;
            //if my current solution is smaller than my best solution so far (smaller solution means I have used the space to separate words fewer times, meaning the words are larger)
            if (bestSolution == "" || solution.Length < bestSolution.Length) 
            {
                bestSolution = solution;
            }
        }
        solutions[instring] = bestSolution;
        return bestSolution;
    }

该算法依赖于条目文本中没有空格或其他符号（这里不是真正的问题，我不关心拆分标点符号）。在字符串中添加的随机附加字母可能会导致错误，除非我将字母表中的每个字母都存储为字典中的“单词”。这意味着“wickedweatherdykjs”将使用上述算法返回“wicked weather dykj s”，而我更喜欢“wicked weather dykjs”的输出。

我更新的指数解决方案：

    static List<string> words = File.ReadLines("E:\\words.txt").ToList(); 
    static Dictionary<char, HashSet<string>> compiledWords = buildDictionary(words);

    private void btnAutoSpacing_Click(object sender, EventArgs e)
    {
        string text = txtText.Text;
        text = RemoveSpacingandNewLines(text); //get rid of anything that breaks the algorithm
        if (text.Length > 150)
        {
            //possibly split the text up into more manageable chunks?
            //considering using textSplit() for this.
        }
        else 
        {
            txtText.Text = find_words(text);
        }
    }

    static IEnumerable<string> textSplit(string str, int chunkSize)
    {
        return Enumerable.Range(0, str.Length / chunkSize)
            .Select(i => str.Substring(i * chunkSize, chunkSize));
    }

    private static Dictionary<char, HashSet<string>> buildDictionary(IEnumerable<string> words)
    {
        var dictionary = new Dictionary<char, HashSet<string>>();

        foreach (var word in words)
        {
            var key = word[0];

            if (!dictionary.ContainsKey(key))
            {
                dictionary[key] = new HashSet<string>();
            }

            dictionary[key].Add(word);
        }

        return dictionary;
    }

    static public string find_words(string instring)
    {
        string bestSolution = "";
        string solution = "";

        if (compiledWords[instring[0]].Contains(instring))
        {
            return instring;
        }

        if (solutions.ContainsKey(instring.ToString()))
        {
            return solutions[instring];
        }

        for (int i = 1; i < instring.Length; i++)
        {
            string partOne = find_words(instring.Substring(0, i));
            string partTwo = find_words(instring.Substring(i, instring.Length - i));
            if (partOne == "" || partTwo == "")
            {
                continue;
            }
            solution = partOne + " " + partTwo;
            if (bestSolution == "" || solution.Length < bestSolution.Length)
            {
                bestSolution = solution;
            }
        }
        solutions[instring] = bestSolution;
        return bestSolution;
    }

我想如何使用维特比算法

我想创建一个算法，它可以为联合字符串找出最可能的解决方案，其中概率是根据我提供算法的文本文件中单词的位置计算的。假设该文件首先以英语中最常用的单词开头，下一行是第二常用的单词，依此类推，直到我的字典中最不常用的单词。大致是这样的

这
是
和
...
律师

这是我想使用的此类文本文件的一个小示例的链接。这是我想使用的更大的文本文件

这个文件定位背后的逻辑如下……

可以合理地假设它们遵循 Zipf 定律，即单词列表中排名为 n 的单词的概率大约为 1/(n log N)，其中 N 是字典中的单词数。

Generic Human，在他出色的 Python 解决方案中，比我能更好地解释这一点。我想将他对问题的解决方案从 Python 转换为 C#，但是经过数小时的尝试后，我无法产生一个可行的解决方案。我也对这样的想法持开放态度，即维特比算法的相对频率可能不是拆分单词的最佳方法，还有其他关于使用 C# 创建解决方案的建议吗？

score 2 · Accepted Answer

维特比算法无法为您提供帮助，但我会就您当前的方法给出两分钱。从您的代码中，它并不完全清楚是什么words。如果您不选择好的数据结构，这可能是一个真正的瓶颈。作为一种直觉，我最初会选择Dictionary<char, HashSet<string>>关键是每个单词的第一个字母：

private static Dictionary<char, HashSet<string>> buildDictionary(IEnumerable<string> words)
{
    var dictionary = new Dictionary<char, HashSet<string>>();

    foreach (var word in words)
    {
        var key = word[0];

        if (!dictionary.ContainsKey(key))
        {
            dictionary[key] = new HashSet<string>();
        }

        dictionary[key].Add(word);
    }

    return dictionary;
}

而且我还考虑将其序列化到磁盘以避免每次都建立它。

不确定您可以像这样做出多少改进（没有您当前实施的完整信息），但对其进行基准测试，看看您是否得到任何改进。

注意：我假设所有单词的大小写一致。

score 2 · Accepted Answer

书面文本是高度上下文相关的，您可能希望使用马尔可夫链来模拟句子结构以估计联合概率。不幸的是，句子结构打破了维特比假设——但仍然有希望，维特比算法是分支定界优化的一个例子，也就是“剪枝动态规划”（我在我的论文中展示的东西），因此即使成本 -不满足拼接假设，您仍然可以制定成本界限并修剪您的候选解决方案群体。但是，让我们暂时将马尔可夫链放在一边……假设概率是独立的并且每个概率都遵循 Zipf 定律，那么您需要知道的是，维特比算法适用于累积附加成本。

对于独立事件，联合概率是个体概率的乘积，因此负对数概率是成本的不错选择。

因此，您的单步成本将是-log(P)或log(1/P)哪个是log(index * log(N))哪个log(index) + log(log(N))，后一项是常数。

c# - 如何在 C# 中实现 Viterbi 算法来拆分连词？

背景

我当前的递归解决方案

我更新的指数解决方案：

我想如何使用维特比算法

2 回答 2

Related

Reference