Boyer-Moore 的算法利用了这样一个事实,即在较长的字符串中搜索“HELLO WORLD”时,如果要找到匹配项,您在给定位置找到的字母会限制在该位置周围可以找到的内容,排序一个海战游戏:如果你在离边界的四个单元格发现公海,你不需要测试剩下的四个单元格,以防有一个 5 单元格的航母躲在那里;不可能。
例如,如果您在第十一个位置找到一个“D”,它可能是 HELLO WORLD 的最后一个字母;但是如果你发现一个'Q','Q'不在HELLO WORLD中的任何地方,这意味着搜索的字符串不能在前11个字符中的任何地方,你可以完全避免在那里搜索。另一方面,“L”可能意味着 HELLO WORLD 存在,从位置 11-3(HELLO WORLD 的第三个字母是 L)、11-4 或 11-10 开始。
搜索时,您可以使用两个增量数组跟踪这些可能性。
所以当你找到一种模式时,你应该这样做,
if (j < 0)
{
// Found a pattern from position i+1 to i+1+patlen
// Add vector or whatever is needed; check we don't overflow it.
if (index_size+1 >= index_counter)
{
index[index_counter] = 0;
return index_size;
}
index[index_counter++] = i+1;
// Reinitialize j to restart search
j = patlen-1;
// Reinitialize i to start at i+1+patlen
i += patlen +1; // (not completely sure of that +1)
// Do not free delta2
// free(delta2);
// Continue loop without altering i again
continue;
}
i += max(delta1[string[i]], delta2[j]);
}
free(delta2);
index[index_counter] = 0;
return index_counter;
这应该返回一个以零结尾的索引列表,前提是您将 a 之类的东西传递size_t *indexes
给函数。
然后该函数将返回 0(未找到)、index_size(匹配太多)或 1 和 index_size-1 之间的匹配数。
例如,这允许添加额外的匹配项,而不必重复对已找到的 (index_size-1) 子字符串的整个搜索;你增加num_indexes
new_num,realloc
数组indexes
,然后将新数组在offset处传递给函数old_index_size-1
,new_num作为新大小,干草堆字符串从索引处匹配的偏移量old_index_size-1
加一开始(不是,正如我在之前的修订版中所写,加上针线的长度;见评论)。
这种方法也会报告重叠匹配,例如在香蕉中搜索ana会找到 b* ana *na 和 ban* ana *。
更新
我测试了上面的,它似乎工作。我通过添加这两个包含来修改 Wikipedia 代码,以防止 gcc 抱怨
#include <stdio.h>
#include <string.h>
然后我修改了if (j < 0)
它以简单地输出它找到的内容
if (j < 0) {
printf("Found %s at offset %d: %s\n", pat, i+1, string+i+1);
//free(delta2);
// return (string + i+1);
i += patlen + 1;
j = patlen - 1;
continue;
}
最后我用这个测试了
int main(void)
{
char *s = "This is a string in which I am going to look for a string I will string along";
char *p = "string";
boyer_moore(s, strlen(s), p, strlen(p));
return 0;
}
并得到了,正如预期的那样:
Found string at offset 10: string in which I am going to look for a string I will string along
Found string at offset 51: string I will string along
Found string at offset 65: string along
如果字符串包含两个重叠序列,则找到 BOTH:
char *s = "This is an andean andeandean andean trouble";
char *p = "andean";
Found andean at offset 11: andean andeandean andean trouble
Found andean at offset 18: andeandean andean trouble
Found andean at offset 22: andean andean trouble
Found andean at offset 29: andean trouble
为了避免重叠匹配,最快的方法是不存储重叠。它可以在函数中完成,但这意味着重新初始化第一个增量向量并更新字符串指针;我们还需要存储第二个i
索引,i2
以防止保存的索引变得非单调。这不值得。更好的:
if (j < 0) {
// We have found a patlen match at i+1
// Is it an overlap?
if (index && (indexes[index] + patlen < i+1))
{
// Yes, it is. So we don't store it.
// We could store the last of several overlaps
// It's not exactly trivial, though:
// searching 'anana' in 'Bananananana'
// finds FOUR matches, and the fourth is NOT overlapped
// with the first. So in case of overlap, if we want to keep
// the LAST of the bunch, we must save info somewhere else,
// say last_conflicting_overlap, and check twice.
// Then again, the third match (which is the last to overlap
// with the first) would overlap with the fourth.
// So the "return as many non overlapping matches as possible"
// is actually accomplished by doing NOTHING in this branch of the IF.
}
else
{
// Not an overlap, so store it.
indexes[++index] = i+1;
if (index == max_indexes) // Too many matches already found?
break; // Stop searching and return found so far
}
// Adapt i and j to keep searching
i += patlen + 1;
j = patlen - 1;
continue;
}