java - X？量词：为什么非 x 给出“零长度”匹配？

Question

量词的x?意思a single or no occurance of x。

为了方便起见，我发布了一个test harness用于将正则表达式与字符串匹配的内容。

a?与 string 相比，我对 regex 感到困惑ababaaaab。

程序的输出是：

Enter your regex: a?

Enter your input string to seacrh: ababaaaab

I found the text "a" starting at index 0 and ending at index 1.
I found the text "" starting at index 1 and ending at index 1. 
I found the text "a" starting at index 2 and ending at index 3.
I found the text "" starting at index 3 and ending at index 3.
I found the text "a" starting at index 4 and ending at index 5.
I found the text "a" starting at index 5 and ending at index 6.
I found the text "a" starting at index 6 and ending at index 7.
I found the text "a" starting at index 7 and ending at index 8.
I found the text "" starting at index 8 and ending at index 8.
I found the text "" starting at index 9 and ending at index 9.

Enter your regex:

我对b的感到困惑。

“正则表达式 a? 不是专门寻找字母“b”；它只是寻找字母“a”的存在（或不存在）。如果量词允许匹配“a”零次，那么任何在不是“a”的输入字符串中将显示为零长度匹配。”

参考

问题：-

第一行是可以理解的，我确实理解存在 b 或任何非 a 是不存在 a 或 0 出现 a，因此应该导致匹配。但是在索引1和2之间没有a（即b的出现）。那么为什么索引1和1之间的文本“”匹配（换句话说，为什么我们得到一个零长度在这里匹配）。根据我的推理，它应该在索引 1 和 2 之间。

import java.io.InputStreamReader;
import java.util.Scanner;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

/*
 *  Enter your regex: foo
 *  Enter input string to search: foo
 *  I found the text foo starting at index 0 and ending at index 3.
 * */

public class RegexTestHarness {

    public static void main(String[] args){

        /*Console console = System.console();
        if (console == null) {
            System.err.println("No console.");
            System.exit(1);
        }*/

        while (true) {

            /*Pattern pattern = 
            Pattern.compile(console.readLine("%nEnter your regex: ", null));*/

            System.out.print("\nEnter your regex: ");

            Scanner scanner = new Scanner(new InputStreamReader(System.in));

            Pattern pattern = Pattern.compile(scanner.next());

            System.out.print("\nEnter your input string to seacrh: ");

            Matcher matcher = 
            pattern.matcher(scanner.next());

            boolean found = false;
            while (matcher.find()) {
                /*console.format("I found the text" +
                    " \"%s\" starting at " +
                    "index %d and ending at index %d.%n",
                    matcher.group(),
                    matcher.start(),
                    matcher.end());*/

                System.out.println("I found the text \"" + matcher.group() + "\" starting at index " + matcher.start() + " and ending at index " + matcher.end() + "."); 

                found = true;
            }
            if(!found){
                //console.format("No match found.%n", null);
                System.out.println("No match found."); 
            }
        }
    }
}

score 1 · Accepted Answer

但是在索引1和2之间没有a（即b的出现）。那么为什么索引1和1之间的文本“”匹配（换句话说，为什么我们得到一个零长度在这里匹配）

匹配的长度是匹配模式的输入字符串的长度。

由于没有“a”，因此只匹配了一个空字符串。

同样，该模式不匹配“非 a 字符序列”，它匹配总长度为 1 的（可能为空的）“a”序列。在这种情况下，匹配的序列是空的。

但是没有a（即b的出现）

a的缺失不是b的出现。a 的缺失发生在 b 的出现之前，并在 b 的出现时结束。

score 1 · Accepted Answer

报告的位置不是角色的位置

要理解的关键是正则表达式引擎不会为您提供找到匹配项的字符位置。

它为您提供了开始成功比赛的起始位置。那个位置不是一个字符。它是字符之间的空间。例如，

位置 0 是字符串的开头。这就是\Aor^断言匹配的地方。
位置 1 是第一个和第二个字符之间的位置。
位置 9 是b末尾最后一个之后的位置ababaaaab。这就是\Zor$断言匹配的地方。

score 0 · Accepted Answer

a?是贪婪的。换句话说，正则表达式引擎将处理如下：

foreach index
    if next char is "a"
        return "a"
    else if next char is ""
        return ""
    end if
end foreach

如果您在输入字符串上应用此算法，您将获得与您提供的相同的输出。

你可以试试它的非贪婪（或懒惰）等价物：a??. 然后，正则表达式引擎将按如下方式处理：

foreach index
    if next char is ""
        return ""
    else if next char is "a"
        return "a"
    end if
end foreach

因此将在每个索引处找到一个空字符串，并且根本a不会输出 no。

java - X？量词：为什么非 x 给出“零长度”匹配？

3 回答 3

Related

Reference