4

I'm trying to find a substring from a string text that is an anagram to a string pattern.

My Question: Could the Rabin-Karp algorithm be adjusted to this purpose? Or are there better algorithms?

I have tried out a brute-force algorithm, which did not work in my case because the text and the pattern can each be up to one million characters.

Update: I've heard there is a worst-case O(n2) algorithm that uses O(1) space. Does anyone know what this algorithm is?

Update 2: For reference, here is pseudocode for the Rabin-Karp algorithm:

function RabinKarp(string s[1..n], string sub[1..m])
    hsub := hash(sub[1..m]);  hs := hash(s[1..m])
    for i from 1 to n-m+1
       if hs = hsub
          if s[i..i+m-1] = sub
              return i
       hs := hash(s[i+1..i+m])
    return not found

This uses a rolling hash function to allow calculating the new hash in O(1), so the overall search is O(nm) in the worst-case, but with a good hash function is O(m + n) in the best case. Is there a rolling hash function that would produce few collisions when searching for anagrams of the string?

4

4 回答 4

9

计算不依赖于模式中字母顺序的模式哈希(例如,使用每个字母的字符代码之和)。然后以“滚动”方式对文本应用相同的散列函数,就像在 Rabin-Karp 中一样。如果哈希匹配,您需要针对文本中的当前窗口执行模式的完整测试,因为哈希也可能与其他值发生冲突。


通过将字母表中的每个符号与质数相关联,然后将这些质数的乘积计算为哈希码,您将减少冲突。

但是,如果您想计算这样的运行乘积,有一些数学技巧可以帮助您:每次跨窗口时,将运行的哈希码乘以符号代码的乘法倒数,即离开窗口,然后乘以进入窗口的符号的代码。

例如,假设您将字母“a”-“z”的哈希值计算为无符号的 64 位值。使用这样的表:

符号 | 代码 | 代码-1
--------+------+---------
   一个 | 3 | 12297829382473034411
   乙 | 5 | 14757395258967641293
   c | 7 | 7905747460161236407
   d | 11 | 3353953467947191203
   电子| 13 | 5675921253449092805
  ...
   z | 103 | 15760325033848937303

n的乘法逆元是乘以n时产生 1 的数,对某个数取模。这里的模数是 2 64,因为您使用的是 64 位数字。所以,5 * 14757395258967641293应该是1,例如。这行得通,因为您只是在GF (2 64 )中相乘。

计算第一个素数的列表很容易,您的平台应该有一个库来有效地计算这些数字的乘法倒数。

从数字 3 开始编码,因为 2 是整数大小的互质数(在您正在处理的任何处理器上都是 2 的幂),并且不能反转。

于 2013-02-03T01:27:42.763 回答
8

One option would be to maintain a sliding window holding a histogram of the letters contained within the window. If that histogram ever ends up equal to the character histogram for the string whose anagram should be found, then you know that what you are looking at is a match and can output it. If not, you know that what you have cannot possibly be a match.

More concretely, create an associative array A mapping from characters to their frequencies. If you want to search for an anagram of string P, then read the first |P| characters from the text string T into A and build the histogram appropriately. You can slide the window one step forward and update A in O(1) associative array operations by decrementing the frequency associated with the first character in the window, then incrementing the frequency associated with the new character that has slid into the window.

If the histograms of the current window and the pattern window are very different, then you should be able to compare them rather quickly. Specifically, let's say that your alphabet is Σ. In the worst case, comparing two histograms would take time O(|Σ|), since you'd have to check each character/frequency pair in the histogram A with the reference histogram. In the best case, though, you'd immediately find a character that causes a mismatch between A and the reference histogram, so you would not need to look at many characters overall.

In theory the worst-case runtime for this approach is O(|T||Σ| + |P|), since you have to do O(n) work to build the initial histogram, then have to do worst-case Σ work per character in T. However, I'd expect that this is probably a lot faster in practice.

Hope this helps!

于 2013-02-03T00:08:49.207 回答
3
  1. 创建一个包含 26 个整数(设置为零)的数组 letter_counts 和一个变量 missing_count 来保存丢失字母的计数。

  2. 对于子字符串中的每个字母,将 letter_counts 的相关 int 递减 1,并将 missing_count 递增 1(因此 missing_count 最终将等于子字符串的大小)。

  3. 假设子串的大小为 k。查看字符串的前 k 个字母。将 letter_counts 的关联 int 增加 1。如果在增加后,值 <= 0,则将 missing_count 减 1。

  4. 现在,我们像这样沿着字符串“前滚”。一个。删除最接近窗口开头的字母,减少 letter_counts 的关联成员。如果在递减之后,我们有一个 int < 0,那么将 missing_count 增加 1。将字符串的第一个字母添加到窗口之外。增加 letter_counts 的关联成员。如果在递增之后我们有一个 int <= 0,那么将 missing_count 减 1。

如果在任何时候missing_count == 0,我们的窗口中有一个搜索字符串的字谜。

我们维护的不变量是 missing_count 保存我们的子字符串中不在我们窗口中的字母数。当它为零时,我们窗口中的字母与我们的子字符串中的字母完全匹配。

这是 Theta(n) - 线性时间,因为我们只查看每个字母一次。

- - 编辑 - -

letter_counts 只需要存储子串的不同字母,并且只需要保存与子串大小(有符号)一样大的整数。因此,内存使用量在子字符串的大小上是线性的,但在字符串的大小上是恒定的。

于 2013-02-12T15:51:13.660 回答
0

提出这个建议可能对我来说很愚蠢,但一种替代方法可能是将两个字符串分解为数组,然后逐个字符地递归搜索它们。

为了避免重复的字符匹配,如果在text数组中找到一个字符,则删除其各自的数组索引,有效地缩短每次匹配的完成数组扫描时间,同时确保text包含 2x 'B' 的 a 获胜'不匹配 apattern与 3x 'B'。

为了提高性能,您可以在逐个字符计数之前扫描两个字符串,并列出每个字符串中存在哪些字母,然后比较这些列表以查看是否存在任何差异(例如尝试查找“apple”中的字母“z”),如果有则将该字符串标记为“Anagram not possible”。

于 2013-02-12T15:02:49.247 回答