regex - 为不使用惰性重复量词编写更好的正则表达式

Question

我有一个正则表达式：

(<select([^>]*>))(.*?)(</select\s*>)

由于它使用惰性重复量词，因此对于较长的字符串（具有超过 500 个选项），它会回溯超过 100,000 次并失败。请帮我找到一个更好的不使用惰性重复量词的正则表达式

score 2 · Accepted Answer

<select[^>]*>[^<]*(?:<(?!/select>)[^<]*)*</select>

...或以人类可读的形式：

<select[^>]*>    # start tag
[^<]*            # anything except opening bracket
(?:              # if you find an open bracket
  <(?!/select>)  #   match it if it's not part of end tag
  [^<]*          #   consume any more non-brackets
)*               # repeat as needed
</select>        # end tag

这是 Friedl 在他的《掌握正则表达式》一书中开发的“展开循环”技术的一个示例。我使用基于不情愿量词的模式在 RegexBuddy 中进行了快速测试：

(?s)<select[^>]*>.*?</select>

...找到匹配项大约需要 6,000 步。展开循环模式只用了 500 步。当我从结束标记 ( </select) 中删除右括号时，使匹配变得不可能，只需要 800 步即可报告失败。

如果您的正则表达式支持所有格量词，请继续使用它们：

<select[^>]*+>[^<]*+(?:<(?!/select>)[^<]*+)*+</select>

实现匹配需要大约相同数量的步骤，但在此过程中可以使用更少的内存。如果无法匹配，它会更快地失败；在我的测试中，它花了大约 500 步，与找到匹配所用的步数相同。

score 1 · Accepted Answer

不幸的是，这不起作用，请参阅 Alan Moore 的答案以获取正确的示例！

(<select([^>]*>))(.*+)(</select\s*>)

从 perl 正则表达式手册页：

默认情况下，当量化的子模式不允许整个模式的其余部分匹配时，Perl 将回溯。但是，这种行为有时是不可取的。因此 Perl 也提供了“占有”量词形式。

       *+     Match 0 or more times and give nothing back
       ++     Match 1 or more times and give nothing back
       ?+     Match 0 or 1 time and give nothing back
       {n}+   Match exactly n times and give nothing back (redundant)
       {n,}+  Match at least n times and give nothing back
       {n,m}+ Match at least n but not more than m times and give nothing back

例如，

      'aaaa' =~ /a++a/

永远不会匹配，因为“a++”将吞噬字符串中的所有“a”，并且不会为模式的其余部分留下任何内容。这个特性对于给 perl 提示它不应该回溯的地方非常有用。例如，典型的“匹配双引号字符串”问题在写成时可以最有效地执行：

      /"(?:[^"\\]++|\\.)*+"/

regex - 为不使用惰性重复量词编写更好的正则表达式

2 回答 2

Related

Reference