c# - .Net 简单正则表达式二次复杂度

Question

在构建一个简单的正则表达式时，我发现它在输入大小增加时具有非常奇怪的性能配置文件。

这是另一个具有类似行为的非常基本的正则表达式：

a+b

我用一个简单的基准对其进行了分析：

Regex regex = new Regex("a+b", RegexOptions.Compiled);

const int maxInputSize = 100;
const int n = 1000;

string input = "";
Stopwatch stopwatch = new Stopwatch();
for (int inputSize = 1; inputSize <= maxInputSize; ++inputSize)
{
    input += 'a';

    stopwatch.Restart();
    for (int i = 0; i < n; ++i)
    {
        regex.Match(input);
    }
    stopwatch.Stop();

    Console.WriteLine(stopwatch.Elapsed.Ticks);
}

它在字符串“a”、“aa”、“aaa”……上运行正则表达式，并测量每个字符串长度进行 n 次匹配所花费的时间。

我知道回溯问题（例如，如果正则表达式类似于(a+a+)+b），但在这种情况下，即使考虑回溯，我也期望线性复杂性。

例如，如果我们想匹配 n 次 'a'，这是我天真地期望的工作流程：

take first 'a'
take second 'a'
...
take last 'a'
ooops nothing more to take => backtracking
release one 'a' and try to match 'b', nothing => backtracking
...
release second 'a' and retry to match 'b', nothing => backtracking
release first 'a'
ooops we're back at the beginning => no match

所以它应该执行类似2n的操作。

（该文档似乎证实了复杂性应该是线性的：http: //msdn.microsoft.com/en-us/library/dsy130b4.aspx）

但相反，我观察到二次复杂度：

在此处输入图像描述

所以我的问题是：

1）我对线性复杂性的期望是不合理的吗？
2）如果是，我对正则表达式匹配有什么遗漏？
3）如果不是，我的基准测试有缺陷吗？为什么？
4）如果我的期望和基准是正确的，可能是什么问题？

提前感谢您的任何意见。

score 5 · Accepted Answer

该Regex.Match函数搜索子字符串匹配：引擎尝试匹配从字符串的任何索引开始的表达式，给出 O(n²) 算法。您可以通过将正则表达式锚定到字符串的开头来实现线性性能：

Regex regex = new Regex("^a+b$", RegexOptions.Compiled);

c# - .Net 简单正则表达式二次复杂度

1 回答 1

Related

Reference