java - 没有在正则表达式中正确获取 * 量词？

Question

我是 regex 的新手，我正在阅读regex quantifier 部分。我有一个关于*量词的问题。下面是*量词的定义：

X*- 找不到或找到多个字母 X
.*- 任何字符序列

根据上面的定义，我写了一个小程序：

public static void testQuantifier() {
    String testStr = "axbx";
    System.out.println(testStr.replaceAll("x*", "M"));
    //my expected output is MMMM but actual output is MaMMbMM
    /*
    Logic behind my expected output is:
    1. it encounters a which means 0 x is found. It should replace a with M.
    2. it encounters x which means 1 x is found. It should replace x with M.
    3. it encounters b which means 0 x is found. It should replace b with M.
    4. it encounters x which means 1 x is found. It should replace x with M.
    so output should be MMMM but why it is MaMMbMM?
    */

    System.out.println(testStr.replaceAll(".*", "M"));
    //my expected output is M but actual output is MM

    /*
    Logic behind my expected output is:
    It encounters axbx, which is any character sequence, it should 
    replace complete sequence with M.
    So output should be M but why it is MM?
    */
}

更新：-

根据修订后的理解，我希望输出为MaMMbM但不是MaMMbMM。所以我不明白为什么我最后会得到一个额外的 M？

我对第一个正则表达式的修改后的理解是：

1. it encounters a which means 0 x is found. It should replace a with Ma.
2. it encounters x which means 1 x is found. It should replace x with M.
3. it encounters b which means 0 x is found. It should replace b with Mb.
4. it encounters x which means 1 x is found. It should replace x with M.
5. Lastly it encounters end of string at index 4. So it replaces 0x at end of String with M.

（虽然我觉得考虑字符串结尾的索引很奇怪）

所以第一部分现在很清楚了。

此外，如果有人可以澄清第二个正则表达式，那将很有帮助。

score 6 · Accepted Answer

这是你出错的地方：

首先它遇到 a 这意味着找到了 0 x。所以它应该用 M 代替 a。

否 - 这意味着x找到0 ，然后a找到an 。您还没有说a应该替换为M... 您已经说过任何数量的xs（包括 0）应该替换为M.

如果您希望每个字符都被替换M，您应该使用.：

System.out.println(testStr.replaceAll(".", "23"));

（我个人会预料到一个结果MaMbM——我正在调查你为什么会得到MaMMbMM——我怀疑这是因为在和x之间有一个 0 的序列，但对我来说仍然有点奇怪。）xb

编辑：如果您查看模式匹配的位置，它会变得更加清晰。这里的代码显示：

Pattern pattern = Pattern.compile("x*");
Matcher matcher = pattern.matcher("axbx");
while (matcher.find()) {
    System.out.println(matcher.start() + "-" + matcher.end());
}

结果（请记住，结尾是唯一的）并带有一些解释：

0-0 (index 0 = 'a', doesn't match)
1-2 (index 1 = 'x', matches)
2-2 (index 2 = 'b', doesn't match)
3-4 (index 3 = 'x', matches)
4-4 (index 4 is the end of the string)

如果你用“M”替换每个匹配项，你最终会得到你实际得到的输出。

我认为根本问题是，如果您有一个可以（完全匹配）空字符串的模式，您可以争辩说该模式在输入中的任意两个字符之间出现了无数次。我可能会尽量避免这种模式——确保任何匹配都必须包含至少一个字符。

score 2 · Accepted Answer

a并且b不会被替换，因为它们与您的正则表达式不匹配。替换不x匹配字母之前或字符串末尾之前的 es 和空字符串。

让我们看看发生了什么：

我们在字符串的开头。正则表达式引擎尝试匹配 anx但失败，因为这里有一个a。
正则表达式引擎回溯，因为x*它还允许x. 我们有一个匹配并替换为M.
正则表达式引擎超越a并成功匹配x。替换为M。
正则表达式引擎现在尝试x在当前位置（上一次匹配之后）匹配，即在b. 它不能。
但它可以再次回溯，x在这里匹配零。替换为M。
正则表达式引擎超越b并成功匹配x。替换为M。
正则表达式引擎现在尝试x在当前位置（在上一次匹配之后）匹配，该位置位于字符串的末尾。它不能。
但它可以再次回溯，x在这里匹配零。替换为M。

顺便说一下，这取决于实现。例如，在 Python 中，它是

>>> re.sub("x*", "M", "axbx")
'MaMbM'

因为在那里，模式的空匹配只有在不与前一个匹配时才会被替换。

java - 没有在正则表达式中正确获取 * 量词？

2 回答 2

Related

Reference