java - Rabin-Karp 不适用于大素数（输出错误）

Question

所以我正在解决这个问题（Rabin Karp 算法）并编写了这个解决方案：

private static void searchPattern(String text, String pattern) {
    int txt_len = text.length(), pat_len = pattern.length();
    int hash_pat = 0, hash_txt = 0; // hash values for pattern and text's substrings
    final int mod = 100005;         // prime number to calculate modulo... larger modulo denominator reduces collisions in hash
    final int d = 256;              // to include all the ascii character codes
    int coeff = 1;                  // stores the multiplier (or coeffecient) for the first index of the sliding window

    /* 
     * HASHING PATTERN:
     * say text    = "abcd", then
     * hashed text = 256^3 *'a' + 256^2 *'b' + 256^1 *'c' + 256^0 *'d'
     */

    // The value of coeff would be "(d^(pat_len - 1)) % mod"
    for (int i = 0; i < pat_len - 1; i++)
        coeff = (coeff * d) % mod;

    // calculate hash of the first window and the pattern itself
    for (int i = 0; i < pat_len; i++) {
        hash_pat = (d * hash_pat + pattern.charAt(i)) % mod;
        hash_txt = (d * hash_txt + text.charAt(i)) % mod;
    }

    for (int i = 0; i < txt_len - pat_len; i++) {
        if (hash_txt == hash_pat) {
            // our chances of collisions are quite less (1/mod) so we dont need to recheck the substring
            System.out.println("Pattern found at index " + i);
        }
        hash_txt = (d * (hash_txt - text.charAt(i) * coeff) + text.charAt(i + pat_len)) % mod; // calculating next window (i+1 th index)

        // We might get negative value of t, converting it to positive
        if (hash_txt < 0)
            hash_txt = hash_txt + mod;
    }
    if (hash_txt == hash_pat) // checking for the last window
        System.out.println("Pattern found at index " + (txt_len - pat_len));
}

现在，如果 mod = 1000000007，这段代码根本不起作用，而只要我们取一些其他素数（足够大，比如 1e5+7），代码就会神奇地开始工作！

代码逻辑失败的行是：

hash_txt = (d * (hash_txt - text.charAt(i) * coeff) + text.charAt(i + pat_len)) % mod;

有人可以告诉我为什么会这样吗？？？也许这是一个愚蠢的疑问，但我就是不明白。

score 3 · Accepted Answer

在 Java 中，anint是一个 32 位整数。如果使用这种数字的计算在数学上产生需要更多二进制数字的结果，那么多余的数字将被默默地丢弃。这称为溢出。

为了避免这种情况，Rabin-Karp 算法在每一步中以某个素数为模减少结果，从而保持数字足够小，以使下一步不会溢出。为此，选择的素数必须适当小，

d * (hash + max(char) * coeff) + max(char)) < max(int)

自从

0 ≤ hash < p,
1 ≤ coeff < p,
max(char) = 2 ¹⁶
max(int) = 2 ³¹

任何小于 2 ⁷ =128 的素数都可以。对于较大的素数，这取决于它们的 coeff 最终是什么，但即使我们选择 coeff = 1 的最小可能的素数，素数也不得超过 2 ²³，这比您使用的素数小得多。

因此，在实践中，使用 Rabin-Karp 的整数数据类型明显大于字符类型，例如 a long（64 位）。然后，任何 < 2 ³⁹的素数都可以。

即便如此，如果值得注意的是你的推理

我们发生冲突的机会非常少（1/mod），所以我们不需要重新检查子字符串

是有缺陷的，因为概率不是由偶然决定的，而是由被检查的字符串决定的。除非您知道输入的概率分布，否则您无法知道失败的概率是多少。这就是 Rabin-Karp 重新检查字符串以确保的原因。

java - Rabin-Karp 不适用于大素数（输出错误）

1 回答 1

Related

Reference