c# - 如何比较 2 个字符串数组并找到所有连续匹配项并保存索引？

Question

例如，如果我有以下 2 个数组：

string[] userSelect = new string[] {"the", "quick", "brown", "dog", "jumps", "over"};
string[] original = new string[] {"the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"};

我正在尝试将 userSelect 数组与原始数组进行比较，并根据索引获取所有连续匹配。userSelect 数组将始终由原始数组中的字符串组成。所以输出将如下所示：

int[] match0 = new int[] {0, 1, 2}; // indices for "the quick brown"
int[] match2 = new int[] {4, 5}; // indices for "jumps over"
int[] match1 = new int[] {3}; // index for "dog"

userSelect 数组长度永远不会超过原始数组长度，但是它可以更短并且单词可以是任何顺序。我该怎么做呢？

score 2 · Accepted Answer

这是我想出的

var matches = 
    (from l in userSelect.Select((s, i) => new { s, i })
     join r in original.Select((s, i) => new { s, i }) 
     on l.s equals r.s 
     group l by r.i - l.i into g
     from m in g.Select((l, j) => new { l.i, j = l.i - j, k = g.Key })
     group m by new { m.j, m.k } into h
     select h.Select(t => t.i).ToArray())
    .ToArray();

这将输出

matches[0] // { 0, 1, 2 } the quick brown
matches[1] // { 4, 5 } jumps over
matches[2] // { 0 } the 
matches[3] // { 3 } dog

使用输入{"the", "quick", "brown", "the", "lazy", "dog"}产生：

matches[0] // { 0, 1, 2 } the quick brown
matches[1] // { 0 } the 
matches[2] // { 3 } the
matches[3] // { 3, 4, 5 } the lazy dog

请注意，调用ToArray是可选的。如果您实际上不需要数组中的结果，则可以将其省略并节省一点处理时间。

要过滤掉完全包含在其他较大序列中的任何序列，您可以运行此代码（注意orderby修改后的查询中的）：

var matches = 
    (from l in userSelect.Select((s, i) => new { s, i })
     join r in original.Select((s, i) => new { s, i }) 
     on l.s equals r.s 
     group l by r.i - l.i into g
     from m in g.Select((l, j) => new { l.i, j = l.i - j, k = g.Key })
     group m by new { m.j, m.k } into h
     orderby h.Count() descending
     select h.Select(t => t.i).ToArray());

int take = 0;
var filtered = matches.Where(m => !matches.Take(take++)
                                          .Any(n => m.All(i => n.Contains(i))))
    .ToArray();

score 2 · Accepted Answer

如果单词不能重复，这会更容易。. .

Dictionary<string, List<int>>一般的想法是从原始单词列表中创建一个。这将告诉您在哪些位置使用了哪些单词。您的样本字典将是：

key="the", value={0, 6}
key="quick", value={1}
key="brown", value={2}
... etc

现在，当您获得用户的输入时，您按顺序逐步执行，在字典中查找单词以获取位置列表。

所以你查了一个词，它就在字典里。您保存从字典返回的位置。查找下一个单词。您需要处理三个条件：

字典里没有这个词。保存您之前的连续分组并转到下一个单词，您可能会在其中开始一个新组。
该词在字典中，但没有返回的位置与预期位置匹配（预期位置比最后一个单词中保存的位置多一个）。保存上一个连续的组并转到下一个单词，您可能会开始一个新组。
该词在字典中，并且返回的位置之一与预期位置匹配。保存这些位置并转到下一个单词。

我希望你能明白。

score 1 · Accepted Answer

这并不完全符合您的要求，但它是一种非常干净且简单的方法，可以获取包含所有常见字符串的新数组（即取两个数组的交集）。

var results = array1.Intersect(array2, StringComparer.OrdinalIgnoreCase);

执行后，resutls数组将包含出现在array1和中的每个字符串（忽略大小写） array2。

如果您需要一些理论知识，则 intersect 方法基于您在 lambda 演算中对集合执行的交集运算。C# 中的集合实现了所有常见的集合操作，因此值得对它们有所了解。这是维基文章的链接；http://en.wikipedia.org/wiki/Intersection_(set_theory)

score 1 · Accepted Answer

这不是很优雅但很有效。当涉及到索引时，Linq 使其通常比简单的循环更复杂且效率更低。

string[] userSelect = new string[] { "the", "quick", "brown", "dog", "jumps", "over" };
string[] original = new string[] { "the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog" };
var consecutiveGroups = new Dictionary<int, IList<string>>();
IList<Tuple<int, string>> uniques = new List<Tuple<int, string>>();

int maxIndex = Math.Min(userSelect.Length, original.Length);
if (maxIndex > 0)
{
    int minIndex = 0;
    int lastMatch = int.MinValue;
    for (int i = 0; i < maxIndex; i++)
    {
        var us = userSelect[i];
        var o = original[i];
        if (us == o)
        {
            if (lastMatch == i - 1)
                consecutiveGroups[minIndex].Add(us);
            else
            {
                minIndex = i;
                consecutiveGroups.Add(minIndex, new List<string>() { us });
            }
            lastMatch = i;
        }
        else
            uniques.Add(Tuple.Create(i, us));
    }
}

输出连续组的索引 + 唯一值的索引：

var consecutiveGroupsIndices = consecutiveGroups
    .OrderByDescending(kv => kv.Value.Count)
    .Select(kv => Enumerable.Range(kv.Key, kv.Value.Count).ToArray()
    .ToArray());
foreach(var consIndexGroup in consecutiveGroupsIndices)
    Console.WriteLine(string.Join(",", consIndexGroup));
Console.WriteLine(string.Join(",", uniques.Select(u => u.Item1)));

score 0 · Accepted Answer

使用 LINQ 增加乐趣

经过几次尝试后，我想出了一个理论上可以是单线的纯 LINQ 解决方案。我确实试图让它变得高效，但当然功能解决方案会导致重复计算，因为你不能保持状态。

我们从一些预处理开始，以节省以后的重复计算。是的，我知道我对 index 所做的做法是有问题的，但如果你小心的话，它会起作用并且很快就会到达那里：

var index = 0;
var lookup = original.ToLookup(s => s, s => index++);

怪物

var occurrences = userSelect
  .Where(lookup.Contains)
  .SelectMany((s, i) => lookup[s]
    .Select(j => new {
      User = userSelect.Skip(i),
      Original = original.Skip(j),
      Skipped = i
    })
    .Select(t => t.User.Zip(t.Original, (u, v) => Tuple.Create(u, v, t.Skipped))
                       .TakeWhile(tuple => tuple.Item1 == tuple.Item2)
    )
    .Select(u => new { 
      Word = s, 
      Start = u.Select(v => v.Item3).Min(), 
      Length = u.Count()
    })
  )
  .GroupBy(v => v.Start + v.Length)
  .Select(g => g.OrderBy(u => u.Start).First())
  .GroupBy(v => v.Word)
  .Select(g => g.OrderByDescending(u => u.Length).First())
  .Select(w => Enumerable.Range(w.Start, w.Length).ToArray())
  .ToList();

打印这个

foreach (var occurrence in occurrences) {
  Console.WriteLine(
    "Maximal match starting with '{0}': [{1}]",
    userSelect[occurrence[0]],
    string.Join(", ", occurrence)
  );
}

给

Maximal match starting with 'the': [0, 1, 2]
Maximal match starting with 'dog': [3]
Maximal match starting with 'jumps': [4, 5]

很明显，您不想在生产中使用此代码，到目前为止，另一种（程序）解决方案会更好。但是，此解决方案的区别在于，除了lookup. 当然也可以写成函数式：

var lookup = original.Select((s, i) => Tuple.Create)
                     .ToLookup(t => t.Item1, t => t.Item2);

这个怎么运作

热身，它创建了一个类似字典的结构，将每个单词关联original到它出现在同一集合中的索引。这将在以后用于从 in 中的每个单词创建尽可能多的匹配序列userSelect（例如，“the”将导致它成为两个匹配序列，因为它在中出现两次original）。

然后：

.Where(lookup.Contains)

这很容易，它会从考虑中删除所有userSelect未出现在中的单词original。

 // For each place where the word s appears in original...
.SelectMany((s, i) => lookup[s]
  // Define the two subsequences of userSelect and original to work on.
  // We are trying to find the number of identical elements until first mismatch.
  .Select(j => new { User = userSelect.Skip(i), Original = original.Skip(j), Skipped = j })

  // Use .Zip to find this subsequence
  .Select(t => t.User.Zip(t.Original, (u, v) => Tuple.Create(u, v, t.Skipped)).TakeWhile(tuple => tuple.Item1 == tuple.Item2))

  // Note the index in original where the subsequence started and its length
  .Select(u => new { Word = s, Start = u.Select(v => v.Item3).Min(), Length = u.Count() })
)

在这一点上，我们已经将每个匹配的单词投影到一个具有和属性userSelect的匿名对象中。但是，匹配长度为 N 的序列也将导致长度为 N-1、N-2、... 1 的较小匹配序列。StartLength

这里的关键是要意识到对于这些集合中的所有子序列Start + Length都是相同的；此外，来自不同集合的子序列将具有不同的总和Start + Length。因此，让我们利用缩减结果：

// Obvious from the above
.GroupBy(v => v.Start + v.Length)

// We want to keep the longest subsequence. Since Start + Length is constant for
// all, it follows the one with the largest Length has the smallest Start:
.Select(g => g.OrderBy(u => u.Start).First())

这仍然会给我们留下与每个单词userSelect出现的次数一样多的匹配项original。所以让我们也把它减少到最长的匹配：

.GroupBy(v => v.Word)
.Select(g => g.OrderByDescending(u => u.Length).First())

我们现在有一个像{ Word = "the", Start = 0, Length = 3 }. 让我们将其转换为索引数组userSelect：

.Select(w => Enumerable.Range(w.Start, w.Length).ToArray())

最后将所有这些阵列放在同一个集合中，任务完成！

c# - 如何比较 2 个字符串数组并找到所有连续匹配项并保存索引？

5 回答 5

使用 LINQ 增加乐趣

怪物

这个怎么运作

Related

Reference