java - 查找所有匹配的子字符串，而不仅仅是“最扩展”的子字符串

Question

编码

String s = "y z a a a b c c z";
Pattern p = Pattern.compile("(a )+(b )+(c *)c");
Matcher m = p.matcher(s);
while (m.find()) {
    System.out.println(m.group());
}

印刷

a a a b c c

哪个是对的。

但从逻辑上讲，子字符串

a a a b c
a a b c c
a a b c
a b c c
a b c

也匹配正则表达式。

那么，我怎样才能让代码也找到那些子字符串，即不仅是最扩展的子字符串，还有它的子字符串？

score 7 · Accepted Answer

您可以使用不情愿的限定词，例如*?and +?。这些匹配尽可能少，与标准相反*，+它们是贪婪的，即尽可能匹配。尽管如此，这只允许您找到特定的“子匹配”，而不是全部。使用前瞻控制非捕获组可以实现更多控制，也在文档中进行了描述。但是为了真正找到所有子匹配项，您可能必须自己做一些事情，即构建正则表达式对应的自动机并使用自定义代码对其进行导航。

score 2 · Accepted Answer

您将需要一个惰性量词。

请尝试以下方法：

Pattern p = Pattern.compile("(a )+(b )+((c )*?)c");

另请注意，我c再次对“”进行了分组，因为我认为这就是您想要的。否则你会发现任意多个空格，而不是“ c”。

score 0 · Accepted Answer

我在这里能想到的唯一方法是生成原始字符串的所有可能子字符串的列表，并将正则表达式与它们中的每一个进行匹配，保留匹配的那些项目。

score 0 · Accepted Answer

我不知道任何可以返回所有有效匹配的正则表达式引擎。

但是我们可以应用一些逻辑来生成所有候选字符串并将其呈现给正则表达式。

通过枚举给定输入的所有可能子字符串来构造候选。

var str = "y z a a a b c c z y z a a a b c c z";
var regex = new Regex("(a )+(b )+(c *)c");

var length = str.Length;

for (int start = 1; start <= length;start++){

    for (int groupLength = 1;  start + groupLength - 1 <= length ;groupLength++){

        var candidate = str.Substring(start-1,groupLength); //.Dump();

        //("\"" + candidate + "\"").Dump();

        var match = regex.Match(candidate);

        if (match.Value == candidate )
        {
            candidate.Dump();
        }

    }
}

这给

a a a b c c 
a a b c c 
a b c c

这似乎是正确的答案，但与您的结果相矛盾：

a a a b c => I state that this is not a match
a a b c c ok
a a b c => I state that this is not a match
a b c c ok
a b c => I state that this is not a match

例如，您提供的正则表达式

(a )+(b )+(c *)c

与结果中的第一个条目不匹配

a a a b c

如果您认为起始位置不重要，上述逻辑可以生成相同的匹配。例如，如果您只是再次重复给定的输入：

"y z a a a b c c z y z a a a b c c z"

它会给：

a a a b c c
a a b c c
a b c c
a a a b c c
a a b c c
a b c c

如果您认为位置不重要，您应该对此结果进行区分

如果认为可能匹配，则应添加输入为空字符串的琐碎情况。

仅供参考，这是正则表达式检查的所有候选人

"y"
"y "
"y z"
"y z "
"y z a"
"y z a "
"y z a a"
"y z a a "
"y z a a a"
"y z a a a "
"y z a a a b"
"y z a a a b "
"y z a a a b c"
"y z a a a b c "
"y z a a a b c c"
"y z a a a b c c "
"y z a a a b c c z"
" "
" z"
" z "
" z a"
" z a "
" z a a"
" z a a "
" z a a a"
" z a a a "
" z a a a b"
" z a a a b "
" z a a a b c"
" z a a a b c "
" z a a a b c c"
" z a a a b c c "
" z a a a b c c z"
"z"
"z "
"z a"
"z a "
"z a a"
"z a a "
"z a a a"
"z a a a "
"z a a a b"
"z a a a b "
"z a a a b c"
"z a a a b c "
"z a a a b c c"
"z a a a b c c "
"z a a a b c c z"
" "
" a"
" a "
" a a"
" a a "
" a a a"
" a a a "
" a a a b"
" a a a b "
" a a a b c"
" a a a b c "
" a a a b c c"
" a a a b c c "
" a a a b c c z"
"a"
"a "
"a a"
"a a "
"a a a"
"a a a "
"a a a b"
"a a a b "
"a a a b c"
"a a a b c "
"a a a b c c"
"a a a b c c "
"a a a b c c z"
" "
" a"
" a "
" a a"
" a a "
" a a b"
" a a b "
" a a b c"
" a a b c "
" a a b c c"
" a a b c c "
" a a b c c z"
"a"
"a "
"a a"
"a a "
"a a b"
"a a b "
"a a b c"
"a a b c "
"a a b c c"
"a a b c c "
"a a b c c z"
" "
" a"
" a "
" a b"
" a b "
" a b c"
" a b c "
" a b c c"
" a b c c "
" a b c c z"
"a"
"a "
"a b"
"a b "
"a b c"
"a b c "
"a b c c"
"a b c c "
"a b c c z"
" "
" b"
" b "
" b c"
" b c "
" b c c"
" b c c "
" b c c z"
"b"
"b "
"b c"
"b c "
"b c c"
"b c c "
"b c c z"
" "
" c"
" c "
" c c"
" c c "
" c c z"
"c"
"c "
"c c"
"c c "
"c c z"
" "
" c"
" c "
" c z"
"c"
"c "
"c z"
" "
" z"
"z"

也很高兴知道 2 种主要类型的正则表达式（NFA 和 DFA）如何工作

来自http://msdn.microsoft.com/en-us/library/e347654k.aspx

.NET（我认为也是 JAVA）是 NFA 正则表达式引擎（与 DFA 相对），当它处理特定的语言元素时，引擎使用贪婪匹配；也就是说，它尽可能多地匹配输入字符串。但它也会在成功匹配子表达式后保存其状态。如果匹配最终失败，引擎可以返回到保存状态，以便尝试其他匹配。这种放弃成功的子表达式匹配以使正则表达式中的后续语言元素也可以匹配的过程称为回溯。

score -1 · Accepted Answer

鉴于这些非常具体的限制（即这不是一般情况下的解决方案），这将起作用：

import java.util.Set;
import java.util.TreeSet;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class test {

    public static void main(String[] args) {

        String s = "y z a a a b c c z";

        Pattern p = Pattern.compile("(a )+(b )+(c ?)+");
        Set<String> set = recurse(s, p, 0);
    }

    public static Set<String> recurse(String s, Pattern p, int depth) {
        int temp = depth;
        while(temp>0) {
            System.out.print("  ");
            temp--;
        }
        System.out.println("-> " +s);

        Matcher matcher = p.matcher(s);
        Set<String> set = new TreeSet<String>();

        if(matcher.find()) {
            String found = matcher.group().trim();
            set.add(found);
            set.addAll(recurse(found.substring(1), p, depth+1));
            set.addAll(recurse(found.substring(0, found.length()-1), p, depth+1));
        }

        while(depth>0) {
            System.out.print("  ");
            depth--;
        }
        System.out.println("<- " +s);
        return set;
    }
}

我有理由确定您可以对其进行调整以适用于其他情况，但是递归到匹配的字符串意味着重叠匹配（如@ahenderson 指出的那样）将不起作用。

java - 查找所有匹配的子字符串，而不仅仅是“最扩展”的子字符串

5 回答 5

Related

Reference