c# - 查找匹配字符串算法

Question

我有很长的 5 个字符串（字符串的数量可能会改变）。这些字符串没有固定的格式。我将提供一个数字来指示子字符串的长度。我想找到给定长度的匹配子字符串。例如字符串是

      1.     abcabcabc
      2.     abcasdfklop

字符串长度：3

给定这些值，输出将是这样的：

比赛#1：

 Matched string :               "abc"

 Matches in first string:        3

 Matching positions:             0,3,6

 Matches in second string:       1

 Match positions:                0

比赛#2：

 Matched string :               "bca"

 Matches in first string:        2

 Matching positions:             1,4

 Matches in second string:       1

 Match    positions:             1

我设法在 4 个 foreach 语句中做到了。但在我看来效率太低了。特别是如果输入大小非常大。有什么建议或捷径可以在 c# 中更有效地管理这个吗？无需成为真正的代码。只有伪代码也可以提供帮助。在此致谢。

score 3 · Accepted Answer

您可以使用后缀数组来执行此操作。（后缀树也可以正常工作，但它们在实施中需要更多的空间、时间和谨慎。）

连接你的两个字符串，用两个都没有出现的字符分隔它们。然后构建一个后缀数组。然后你可以读出你的答案。

标准后缀数组为您提供了一个按字典顺序排序的指向字符串后缀的指针数组，以及一个“最长公共前缀长度”数组，告诉您两个字典顺序连续后缀的最长公共前缀有多长。

使用最长的公共前缀长度数组来获取您想要的信息是相当简单的；找到最长公共前缀长度数组的所有最大子数组，其中最长公共前缀长度至少是查询长度，然后，对于在第一个字符串和第二个字符串中都有匹配的每个子数组，报告适当的前缀和报告它出现 K+1 次，其中 K 是最大子数组的长度。

另一种更容易编码的方法是散列适当长度的所有子字符串。您可以使用任何滚动哈希函数轻松完成此操作。将动态指针数组存储到每个哈希的字符串中；对所有字符串进行哈希处理后，遍历所有出现的哈希并查找匹配项。您需要以某种方式处理误报；一种（概率）方法是使用多个散列函数，直到误报概率小到可以接受为止。另一种可能只在匹配项很少的情况下才可接受的方法是直接比较字符串。

score 2 · Accepted Answer

如果您设法在 4 个未嵌套的 foreach 语句中做到这一点，那么您应该很好，您可能不需要优化。

这是我会尝试的。创建一个看起来像这样的结构

class SubString
{
    string str;
    int position;
}

将两个字符串分成所有可能的子字符串并将它们存储到一个数组中。这具有 O(n2) 复杂度。

现在按字符串长度（ O(n*log(n)) 复杂度）对这些数组进行排序，并通过这两个来识别匹配项。

你需要额外的结构来保存结果，这可能需要更多的调整，但你知道这是怎么回事。

score 1 · Accepted Answer

您可以使用后缀树的变体来解决此问题。http://en.wikipedia.org/wiki/Longest_common_substring_problem 另请查看：算法：查找保留顺序的两个字符串之间的所有公共子字符串

score 0 · Accepted Answer

如果使用非常大的字符串，内存可能会成为问题。下面的代码找到最长的公共子字符串并覆盖包含较小公共子字符串的变量，但可以轻松更改以将索引和长度推送到列表，然后作为字符串数组返回。

这是来自 Ashutosh Singh 的重构 C++ 代码，位于https://iq.opengenus.org/longest-common-substring-using-rolling-hash/ - 这将在 O(N * log(N)^2) 时间内找到子字符串和 O(N) 空间

using System;
using System.Collections.Generic;
public class RollingHash
{
    private class RollingHashPowers
    {
        // _mod = prime modulus of polynomial hashing
        // any prime number over a billion should suffice
        internal const int _mod = (int)1e9 + 123;
        // _hashBase = base (point of hashing)
        // this should be a prime number larger than the number of characters used
        // in my use case I am only interested in ASCII (256) characters
        // for strings in languages using non-latin characters, this should be much larger
        internal const long _hashBase = 257;
        // _pow1 = powers of base modulo mod
        internal readonly List<int> _pow1 = new List<int> { 1 };
        // _pow2 = powers of base modulo 2^64
        internal readonly List<long> _pow2 = new List<long> { 1L };

        internal void EnsureLength(int length)
        {
            if (_pow1.Capacity < length)
            {
                _pow1.Capacity = _pow2.Capacity = length;
            }
            for (int currentIndx = _pow1.Count - 1; currentIndx < length; ++currentIndx)
            {
                _pow1.Add((int)(_pow1[currentIndx] * _hashBase % _mod));
                _pow2.Add(_pow2[currentIndx] * _hashBase);
            }
        }
    }

    private class RollingHashedString
    {
        readonly RollingHashPowers _pows;
        readonly int[] _pref1; // Hash on prefix modulo mod
        readonly long[] _pref2; // Hash on prefix modulo 2^64

        // Constructor from string:
        internal RollingHashedString(RollingHashPowers pows, string s, bool caseInsensitive = false)
        {
            _pows = pows;
            _pref1 = new int[s.Length + 1];
            _pref2 = new long[s.Length + 1];

            const long capAVal = 'A';
            const long capZVal = 'Z';
            const long aADif = 'a' - 'A';

            unsafe
            {
                fixed (char* c = s)
                {
                    // Fill arrays with polynomial hashes on prefix
                    for (int i = 0; i < s.Length; ++i)
                    {
                        long v = c[i];
                        if (caseInsensitive && capAVal <= v && v <= capZVal)
                        {
                            v += aADif;
                        }
                        _pref1[i + 1] = (int)((_pref1[i] + v * _pows._pow1[i]) % RollingHashPowers._mod);
                        _pref2[i + 1] = _pref2[i] + v * _pows._pow2[i];
                    }
                }
            }
        }

        // Rollingnomial hash of subsequence [pos, pos+len)
        // If mxPow != 0, value automatically multiply on base in needed power.
        // Finally base ^ mxPow
        internal Tuple<int, long> Apply(int pos, int len, int mxPow = 0)
        {
            int hash1 = _pref1[pos + len] - _pref1[pos];
            long hash2 = _pref2[pos + len] - _pref2[pos];
            if (hash1 < 0)
            {
                hash1 += RollingHashPowers._mod;
            }
            if (mxPow != 0)
            {
                hash1 = (int)((long)hash1 * _pows._pow1[mxPow - (pos + len - 1)] % RollingHashPowers._mod);
                hash2 *= _pows._pow2[mxPow - (pos + len - 1)];
            }
            return Tuple.Create(hash1, hash2);
        }
    }

    private readonly RollingHashPowers _rhp;
    public RollingHash(int longestLength = 0)
    {
        _rhp = new RollingHashPowers();
        if (longestLength > 0)
        {
            _rhp.EnsureLength(longestLength);
        }
    }

    public string FindCommonSubstring(string a, string b, bool caseInsensitive = false)
    {
        // Calculate max neede power of base:
        int mxPow = Math.Max(a.Length, b.Length);
        _rhp.EnsureLength(mxPow);
        // Create hashing objects from strings:
        RollingHashedString hash_a = new RollingHashedString(_rhp, a, caseInsensitive);
        RollingHashedString hash_b = new RollingHashedString(_rhp, b, caseInsensitive);

        // Binary search by length of same subsequence:
        int pos = -1;
        int low = 0;
        int minLen = Math.Min(a.Length, b.Length);
        int high = minLen + 1;
        var tupleCompare = Comparer<Tuple<int, long>>.Default;
        while (high - low > 1)
        {
            int mid = (low + high) / 2;
            List<Tuple<int, long>> hashes = new List<Tuple<int, long>>(a.Length - mid + 1);
            for (int i = 0; i + mid <= a.Length; ++i)
            {
                hashes.Add(hash_a.Apply(i, mid, mxPow));
            }
            hashes.Sort(tupleCompare);
            int p = -1;
            for (int i = 0; i + mid <= b.Length; ++i)
            {
                if (hashes.BinarySearch(hash_b.Apply(i, mid, mxPow), tupleCompare) >= 0)
                {
                    p = i;
                    break;
                }
            }
            if (p >= 0)
            {
                low = mid;
                pos = p;
            }
            else
            {
                high = mid;
            }
        }
        // Output answer:
        return pos >= 0
            ? b.Substring(pos, low)
            : string.Empty;
    }
}

c# - 查找匹配字符串算法

4 回答 4

Related

Reference