c# - 来自字符串的英语词典单词匹配

Question

我试图解决一个问题，即从字典文件中识别英语单词到给定字符串的最佳匹配。

例如（“lines”是字典单词列表）：

string testStr = "cakeday";

for (int x= 0; x<= testStr.Length; x++)
{
  string test = testStr.Substring(x);

   if (test.Length > 0)
   {
      string test2 = testStr.Remove(counter);
      int count = (from w in lines where w.Equals(test) || w.Equals(test2) select w).Count();
      Console.WriteLine("Test: {0} / {1} : {2}", test, test2, count);
    }
}

给出输出：

Test: cakeday /   : 0
Test: akeday / c  : 1
Test: keday / ca  : 0
Test: eday / cak  : 0
Test: day / cake  : 2
Test: ay / caked  : 1
Test: y / cakeda  : 1

显然“day / cake”最适合该字符串，但是如果我要在字符串中引入第三个单词，例如“cakedaynow”，它就不会那么好用。

我知道这个例子是原始的，它更像是一个概念证明，并且想知道是否有人对这种类型的字符串分析有任何经验？

谢谢！

score 2 · Accepted Answer

你会想要研究适合你想要做的事情的算法类。从Wikipedia 上的近似字符串匹配开始。

此外，这是一个用 C# 编写的 Levenshtein Edit Distance 实现，可以帮助您入门：

using System;

namespace StringMatching
{
    /// <summary>
    /// A class to extend the string type with a method to get Levenshtein Edit Distance.
    /// </summary>
    public static class LevenshteinDistanceStringExtension
    {
        /// <summary>
        /// Get the Levenshtein Edit Distance.
        /// </summary>
        /// <param name="strA">The current string.</param>
        /// <param name="strB">The string to determine the distance from.</param>
        /// <returns>The Levenshtein Edit Distance.</returns>
        public static int GetLevenshteinDistance(this string strA, string strB)
        {
            if (string.IsNullOrEmpty(strA) && string.IsNullOrEmpty(strB))
                return 0;

            if (string.IsNullOrEmpty(strA))
                return strB.Length;

            if (string.IsNullOrEmpty(strB))
                return strA.Length;

            int[,] deltas; // matrix
            int lengthA;
            int lengthB;
            int indexA;
            int indexB;
            char charA;
            char charB;
            int cost; // cost

            // Step 1
            lengthA = strA.Length;
            lengthB = strB.Length;

            deltas = new int[lengthA + 1, lengthB + 1];

            // Step 2
            for (indexA = 0; indexA <= lengthA; indexA++)
            {
                deltas[indexA, 0] = indexA;
            }

            for (indexB = 0; indexB <= lengthB; indexB++)
            {
                deltas[0, indexB] = indexB;
            }

            // Step 3
            for (indexA = 1; indexA <= lengthA; indexA++)
            {
                charA = strA[indexA - 1];

                // Step 4
                for (indexB = 1; indexB <= lengthB; indexB++)
                {
                    charB = strB[indexB - 1];

                    // Step 5
                    if (charA == charB)
                    {
                        cost = 0;
                    }
                    else
                    {
                        cost = 1;
                    }

                    // Step 6
                    deltas[indexA, indexB] = Math.Min(deltas[indexA - 1, indexB] + 1, Math.Min(deltas[indexA, indexB - 1] + 1, deltas[indexA - 1, indexB - 1] + cost));
                }
            }

            // Step 7
            return deltas[lengthA, lengthB];
        }
    }
}

score 1 · Accepted Answer

为什么不：

检查搜索词中的所有字符串，从当前搜索位置提取到字符串的所有可能长度，并提取所有发现的词。例如：

var list = new List<string>{"the", "me", "cat", "at", "theme"};
const string testStr = "themecat";
var words = new List<string>();
var len = testStr.Length;
for (int x = 0; x < len; x++)
{
    for(int i = (len - 1); i > x; i--)
    {
        string test = testStr.Substring(x, i - x + 1);
        if (list.Contains(test) && !words.Contains(test))
        {
            words.Add(test);
        }
    }
}

words.ForEach(n=> Console.WriteLine("{0}, ",n));//spit out current values

输出：

主题，我，猫，在

编辑

Live Scenario 1： 例如，假设您想始终在杂乱的句子中选择最长的单词，您可以从前向前阅读，从而减少阅读的文本量，直到读完为止。使用字典更容易，通过存储发现单词的索引，我们可以快速检查我们是否存储了一个包含我们之前正在评估的另一个单词的单词。

例子：

var list = new List<string>{"the", "me", "cat", "at", "theme", "crying", "them"};
const string testStr = "themecatcryingthem";
var words = new Dictionary<int, string>();
var len = testStr.Length;
for (int x = 0; x < len; x++)
{
    int n = len > 28 ? 28 : len;//assuming 28 is the maximum length of an english word
    for(int i = (n - 1); i > x; i--)
    {
        string test = testStr.Substring(x, i - x + 1);
        if (list.Contains(test))
        {
            if (!words.ContainsValue(test))
            {
                bool found = false;//to check if there's a shorter item starting from same index
                var key = testStr.IndexOf(test, x, len - x);
                foreach (var w in words)
                {
                    if (w.Value.Contains(test) && w.Key != key && key == (w.Key + w.Value.Length - test.Length))
                    {
                        found = true;
                    }
                }
                if (!found && !words.ContainsKey(key)) words.Add(key, test);
            }
        }
    }
}

words.Values.ToList().ForEach(n=> Console.WriteLine("{0}, ",n));//spit out current values

输出：

主题，猫，哭，他们

c# - 来自字符串的英语词典单词匹配

2 回答 2

编辑

Related

Reference