regex - 什么是正则表达式中的单词边界？

Question

我正在尝试使用正则表达式来匹配以空格分隔的数字。我找不到\b（“单词边界”）的精确定义。我曾假设这-12将是一个“整数词”（由匹配 \b\-?\d+\b），但似乎这不起作用。我会很感激知道的方式。

[我在 Java 1.6 中使用 Java 正则表达式]

例子：

Pattern pattern = Pattern.compile("\\s*\\b\\-?\\d+\\s*");
String plus = " 12 ";
System.out.println(""+pattern.matcher(plus).matches());

String minus = " -12 ";
System.out.println(""+pattern.matcher(minus).matches());

pattern = Pattern.compile("\\s*\\-?\\d+\\s*");
System.out.println(""+pattern.matcher(minus).matches());

这将返回：

true
false
true

score 140 · Accepted Answer

在大多数正则表达式方言中，单词边界是\w和\W（非单词字符）之间的位置，或者如果字符串以单词字符 ( [0-9A-Za-z_]) 开始或结束（分别），则它位于字符串的开头或结尾。

因此，在 string"-12"中，它将在 1 之前或 2 之后匹配。破折号不是单词字符。

score 50 · Accepted Answer

在学习正则表达式的过程中，我真的陷入了元字符\b. 当我反复问自己“它是什么，它是什么”时，我确实没有理解它的含义。通过使用该网站进行了一些尝试后，我发现了每个单词开头和单词末尾的粉红色垂直破折号。我当时就很好理解了它的含义。现在正是word( \w)-boundary。

我的观点仅仅是以极大的理解为导向。应该从另一个答案中检查其背后的逻辑。

score 33 · Accepted Answer

单词边界可以出现在以下三个位置之一：

在字符串的第一个字符之前，如果第一个字符是单词字符。
在字符串的最后一个字符之后，如果最后一个字符是单词字符。
在字符串中的两个字符之间，其中一个是单词字符，另一个不是单词字符。

单词字符是字母数字；减号不是。取自正则表达式教程。

score 20 · Accepted Answer

我想解释艾伦摩尔的回答

字边界是一个位置，其前面有一个字字符但后面没有一个，或者后面有一个字字符但前面没有一个。

假设我有一个字符串“This is a c a t, and she's a wesome”，并且我想替换所有出现的字母 'a' 只有当这个字母 ('a') 存在于“单词的边界”时,

换句话说：'cat'里面的字母不应该被替换。a

所以我将执行正则表达式（在Python中）为

re.sub(r"\ba","e", myString.strip())//替换a为e

所以输出将是

这是一个c a t，她是一个好人->

This is e c a t e nd she's e wesome //结果

score 16 · Accepted Answer

字边界是一个位置，其前面有一个字字符但后面没有一个，或者后面有一个字字符但前面没有一个。

score 10 · Accepted Answer

我在这里谈论\b实际上是什么风格的正则表达式边界。

简短的故事是它们是有条件的。他们的行为取决于他们旁边的东西。

# same as using a \b before:
(?(?=\w) (?<!\w)  | (?<!\W) )

# same as using a \b after:
(?(?<=\w) (?!\w)  | (?!\W)  )

有时这不是你想要的。请参阅我的其他答案以进行详细说明。

score 7 · Accepted Answer

.NET在文本中搜索、C++、C#和等词时，我遇到了更严重的问题C。你会认为计算机程序员会比给一种难以编写正则表达式的语言命名更好。

无论如何，这就是我发现的（主要来自http://www.regular-expressions.info，这是一个很棒的网站）：在大多数正则表达式中，与速记字符类匹配的字符\w是被单词边界视为单词字符的字符。Java 是个例外。Java 支持 Unicode，\b但不支持\w. （我确信当时有充分的理由）。

\w代表“字字符” 。它总是匹配 ASCII 字符[A-Za-z0-9_]。注意包含下划线和数字（但不是破折号！）。在大多数支持 Unicode 的风格中，\w包括来自其他脚本的许多字符。关于实际包含哪些字符存在很多不一致之处。通常包括来自字母脚本和表意文字的字母和数字。除了下划线和非数字的数字符号之外的连接标点符号可能包含也可能不包含。XML Schema 和 XPath 甚至包括\w. 但是 Java、JavaScript 和 PCRE 仅匹配带有\w.

这就是为什么基于 Java 的正则表达式搜索C++，C#或者.NET（即使你记得要避开句号和加号）被\b.

注意：我不确定如何处理文本中的错误，例如有人在句末的句号后没有加空格。我允许这样做，但我不确定这一定是正确的做法。

无论如何，在 Java 中，如果您正在搜索那些名称怪异的语言的文本，您需要\b用空格和标点符号前后替换。例如：

public static String grep(String regexp, String multiLineStringToSearch) {
    String result = "";
    String[] lines = multiLineStringToSearch.split("\\n");
    Pattern pattern = Pattern.compile(regexp);
    for (String line : lines) {
        Matcher matcher = pattern.matcher(line);
        if (matcher.find()) {
            result = result + "\n" + line;
        }
    }
    return result.trim();
}

然后在您的测试或主要功能中：

    String beforeWord = "(\\s|\\.|\\,|\\!|\\?|\\(|\\)|\\'|\\\"|^)";   
    String afterWord =  "(\\s|\\.|\\,|\\!|\\?|\\(|\\)|\\'|\\\"|$)";
    text = "Programming in C, (C++) C#, Java, and .NET.";
    System.out.println("text="+text);
    // Here is where Java word boundaries do not work correctly on "cutesy" computer language names.  
    System.out.println("Bad word boundary can't find because of Java: grep with word boundary for .NET="+ grep("\\b\\.NET\\b", text));
    System.out.println("Should find: grep exactly for .NET="+ grep(beforeWord+"\\.NET"+afterWord, text));
    System.out.println("Bad word boundary can't find because of Java: grep with word boundary for C#="+ grep("\\bC#\\b", text));
    System.out.println("Should find: grep exactly for C#="+ grep("C#"+afterWord, text));
    System.out.println("Bad word boundary can't find because of Java:grep with word boundary for C++="+ grep("\\bC\\+\\+\\b", text));
    System.out.println("Should find: grep exactly for C++="+ grep(beforeWord+"C\\+\\+"+afterWord, text));

    System.out.println("Should find: grep with word boundary for Java="+ grep("\\bJava\\b", text));
    System.out.println("Should find: grep for case-insensitive java="+ grep("?i)\\bjava\\b", text));
    System.out.println("Should find: grep with word boundary for C="+ grep("\\bC\\b", text));  // Works Ok for this example, but see below
    // Because of the stupid too-short cutsey name, searches find stuff it shouldn't.
    text = "Worked on C&O (Chesapeake and Ohio) Canal when I was younger; more recently developed in Lisp.";
    System.out.println("text="+text);
    System.out.println("Bad word boundary because of C name: grep with word boundary for C="+ grep("\\bC\\b", text));
    System.out.println("Should be blank: grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));
    // Make sure the first and last cases work OK.

    text = "C is a language that should have been named differently.";
    System.out.println("text="+text);
    System.out.println("grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));

    text = "One language that should have been named differently is C";
    System.out.println("text="+text);
    System.out.println("grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));

    //Make sure we don't get false positives
    text = "The letter 'c' can be hard as in Cat, or soft as in Cindy. Computer languages should not require disambiguation (e.g. Ruby, Python vs. Fortran, Hadoop)";
    System.out.println("text="+text);
    System.out.println("Should be blank: grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));

PS 我感谢http://regexpal.com/没有他们，正则表达式的世界将会非常悲惨！

score 4 · Accepted Answer

查看有关边界条件的文档：

http://java.sun.com/docs/books/tutorial/essential/regex/bounds.html

查看此示例：

public static void main(final String[] args)
    {
        String x = "I found the value -12 in my string.";
        System.err.println(Arrays.toString(x.split("\\b-?\\d+\\b")));
    }

当您打印出来时，请注意输出是这样的：

[我在我的字符串中找到了值 -。]

这意味着“-”字符不会被认为是在单词的边界上，因为它不被视为单词字符。看起来@brianary 有点击败我，所以他得到了支持。

score 3 · Accepted Answer

参考：掌握正则表达式 (Jeffrey EF Friedl) - O'Reilly

\b 相当于(?<!\w)(?=\w)|(?<=\w)(?!\w)

score 2 · Accepted Answer

单词边界 \b 用于一个单词应该是单词字符而另一个单词应该是非单词字符的地方。负数的正则表达式应该是

--?\b\d+\b

检查工作演示

score 1 · Accepted Answer

我相信您的问题是由于-不是单词字符的事实。因此，单词边界将在之后匹配-，因此不会捕获它。单词边界匹配字符串中第一个单词字符之前和最后一个单词字符之后，以及在它之前是单词字符或非单词字符的任何位置，而在它之后则相反。另请注意，字边界是零宽度匹配。

一种可能的选择是

(?:(?:^|\s)-?)\d+\b

这将匹配以空格字符和可选破折号开头并以单词边界结尾的任何数字。它还将匹配从字符串开头开始的数字。

score 0 · Accepted Answer

当您使用时\\b(\\w+)+\\b，这意味着与仅包含单词字符的单词完全匹配([a-zA-Z0-9])

在您的情况下，例如\\b在正则表达式开头的设置将接受-12（带空格），但又不会接受-12（不带空格）

供参考以支持我的话：https ://docs.oracle.com/javase/tutorial/essential/regex/bounds.html

score -1 · Accepted Answer

-1

我认为这是最后一场比赛的边界（即字符后面）或字符串的开头或结尾。

于 2009-08-24T20:55:23.930 回答

regex - 什么是正则表达式中的单词边界？

13 回答 13

Related

Reference