java - 在没有线性搜索的情况下找出 Java 正则表达式中的哪个组匹配？

Question

我有一些以编程方式组装的巨大正则表达式，像这样

(A)|(B)|(C)|...

每个子模式都在其捕获组中。当我得到一个匹配项时，如何确定哪个组匹配而不用线性测试每个组group(i)以查看它返回一个非空字符串？

score 4 · Accepted Answer

If your regex is programmatically generated, why not programmatically generate n separate regexes and test each of them in turn? Unless they share a common prefix and the Java regex engine is clever, all alternatives get tested anyway.

Update: I just looked through the Sun Java source, in particular, java.util.regex.Pattern$Branch.match(), and that does also simply do a linear search over all alternatives, trying each in turn. The other places where Branch is used do not suggest any kind of optimization of common prefixes.

score 1 · Accepted Answer

您可以使用非捕获组，而不是：

(A)|(B)|(C)|...

用。。。来代替

((?:A)|(?:B)|(?:C))

非捕获组 (?:) 不会包含在组计数中，但分支的结果将被捕获到外部 () 组中。

score 0 · Accepted Answer

我认为您无法绕过线性搜索，但您可以通过使用start(int)而不是group(int).

static int getMatchedGroupIndex(Matcher m)
{ 
  int index = -1;
  for (int i = 1, n = m.groupCount(); i <= n; i++)
  {
    if ( (index = m.start(i)) != -1 )
    {
      break;
    }
  }
  return index;
}

这样，您无需为每个组生成一个子字符串，而只需查询一个int表示其起始索引的值。

score 0 · Accepted Answer

从各种评论看来，简单的答案似乎是“否”，并且使用单独的正则表达式是一个更好的主意。为了改进这种方法，您可能需要在生成它们时找出常见的模式前缀，或者使用您自己的正则表达式（或其他）模式匹配引擎。但在您进行所有这些努力之前，您需要确保这是您系统中的一个重要瓶颈。换句话说，对它进行基准测试，看看性能对于实际输入数据是否可以接受，如果不是，它可以查看真正的瓶颈在哪里。

score 0 · Accepted Answer

Break up your regex into three:

String[] regexes = new String[] { "pattern1", "pattern2", "pattern3" };

for(int i = 0; i < regexes.length; i++) {
  Pattern pattern = Pattern.compile(regexes[i]);

  Matcher matcher = pattern.matcher(inputStr);
  if(matcher.matches()) {
     //process, optionally break out of loop
  }
}

public int getMatchedGroupIndex(Matcher matcher) { 
  int index = -1;  

  for(int i = 0; i < matcher.groupCount(); i++) {
    if(matcher.group(i) != null && matcher.group(i).trim().length() > 0) {
      index = i;
    }
  }

  return index;
}

The alternative is:

for(int i = 0; i < matcher.groupCount(); i++) {
  if(matcher.group(i) != null && matcher.group(i).trim().length() > 0) {
     //process, optionally break out of loop
  }
}

java - 在没有线性搜索的情况下找出 Java 正则表达式中的哪个组匹配？

5 回答 5

Related

Reference