algorithm - 为什么我们需要在 Rabin Karp 算法中每次哈希值相同时检查模式匹配

Question

我不明白为什么我们需要在每次哈希为模式和文本返回相同值时检查子字符串匹配的原因。对于字符串，返回的哈希值不是唯一的吗？

score 3 · Accepted Answer

Rabin Karp 算法中使用的散列函数是“滚动散列”，例如Rabin Fingerprint，之所以选择它是因为它的属性是可以根据先前的散列轻松计算散列，而不是因为它的抗碰撞性。

在 Rabin Karp 算法中，我们需要计算一个滑动子串的哈希值。例如，假设我们在此文本中搜索 24 个字符的字符串：

"this is the text we are comparing"

我们需要计算这些子字符串的哈希值：

"this is the text we are "
"his is the text we are c"
"is is the text we are co"
"s is the text we are com"
" is the text we are comp"
"is the text we are compa"
"s the text we are compar"
" the text we are compari"
"the text we are comparin"
"he text we are comparing"

所以我们选择一个“滚动哈希”函数，在计算第一个子字符串的哈希之后，我们可以使用第一个哈希、从子字符串中删除的字符和添加的字符来计算第二个子字符串的哈希对它：

"this is the text we are "  ->  hash1
"his is the text we are c"  ->  hash1 -t +c  ->  hash2

这样的“滚动散列”函数不一定是找到两个具有相同散列的字符串的可能性很小，就像在加密散列函数中一样。因此，哈希相同的事实并不能保证子字符串与搜索字符串相同；因此我们需要做一个完整的字符串比较来确定。

请注意，任何创建比输入短的散列的散列函数必然会发生冲突。而使用比输入字符串短得多的哈希值是 Rabin Karp 算法的重点；比较哈希值比比较长字符串更有效。

algorithm - 为什么我们需要在 Rabin Karp 算法中每次哈希值相同时检查模式匹配

1 回答 1

Related

Reference