java - Can the Boyer-Moore algorithm be altered to search for "full words" only?

Question

I've written a Java function that implements the Boyer-Moore algorithm to search for a given substring in a char array. It returns a list of every index where the substring is found in the array. For example, if the char array being searched contained the phrase "The Walking Dead" and the substring given as a parameter was "king", a list of size one containing the value 7 would be returned.

I would like to change this function so that only indexes of substrings that are full words in the char array would be returned. So the previous example would return an empty list, but if the substring was changed to "The", "Walking" or "Dead", lists of size 1 would be returned with values 0, 4, and 12 respectively.

Is this sort of functionality possible to implement using the Boyer-Moore algorithm? Are there any other string searching algorithms that would be able to produce these results efficiently?

score 3 · Accepted Answer

这可能不是您想要的那种答案，但您可以更改参数而不是算法：在搜索字符串的开头和结尾以及目标字符串的开头和结尾添加一个空格（以防万一第一个或最后一个词是命中）。您还需要对标点符号和其他非单词字符进行特殊处理。

score 0 · Accepted Answer

只需使用 Java 的模式——它已经在内部实现了 Boyer Moore。然后 '\b' 匹配一个单词边界。如：

    Pattern pattern = Pattern.compile("\\b" + Pattern.quote(needle) + "\\b");
    Matcher m = pattern.matcher(haystack);
    while (m.find()) {
        System.out.println(m.start());
    }

score 0 · Accepted Answer

是的，你可以调整 Boyer-Moore 来做到这一点：

在每个“匹配”之后，您可以检查匹配的开始和结束位置是否在单词边界处。
您将搜索从 "king" 更改为 'word-boundary + "king" + word-boundary'，其中 'word-boundary' 是一个伪字符，您的修改后的 BM 与任何单词边界字符匹配。
您可以预处理输入以用表示“单词边界”的特殊字符替换所有空格、标点符号等，然后进行搜索。

其中哪一个可能更好取决于您如何实现它们......以及您是否要重复搜索相同的输入文本。

java - Can the Boyer-Moore algorithm be altered to search for "full words" only?

3 回答 3

Related

Reference