java - What are the most efficient ways to write regex with boundary matching in Java?

Question

So I found out that the word boundary works great to make sure that exactly that word is being found within the text and that we don't cut other words if they contain just parts of this word, however I noticed it works bad at the String start and end.

So ideally I would expect a regex like this also work well in string start and end, because that's where the word also starts/ends:

String regex1 = "\\b" + searchedWord + "\\b";

However it turned out I had to transform the regex like this to make sure it works well also for string start and end:

String regex2 = "(^|\\b)" + searchedWord + "($|\\b)";

I haven't discovered any side effects of using the latter regex yet, however I would like to know if there is any special boundary or how to write the boundary more efficiently to make it less ugly and less counter-intuitive.

Does anybody know better ways? Perhaps you can also improve my suggested regex as well in case you are aware of any problems using it.

score 0 · Accepted Answer

如果您的第一个和最后一个字符searchWord是单词字符，则不会有副作用。

“副作用”只有在两端的字符都是非单词字符时才会出现。

现在，\b可以在 4 个位置匹配：字符串开头和单词字符之间，非单词字符和单词字符之间，单词和非单词字符之间，以及单词字符和字符串结尾之间。如果您需要确保 char 之前没有单词 char searchWord，您可以使用明确的(?<!\w)否定后向查找，并确保单词后没有单词 char，您可以使用(?!\w)否定前瞻。

还要记住\b，与一样\w，它本身并不支持 Unicode。添加Pattern.UNICODE_CHARACTER_CLASS标志或(?U)：

String regex1 = "(?U)(?<!\\w)" + searchedWord + "(?!\\w)";

其他方法通常包括确保周围（或字符串的开头/结尾）有空格

String regex1 = "(?U)(?<!\\S)" + searchedWord + "(?!\\S)";

但是，这不会在标点符号之前或之后匹配。

java - What are the most efficient ways to write regex with boundary matching in Java?

1 回答 1

Related

Reference