c# - 正则表达式性能随着时间的推移而下降

Question

我编写了一个简单的测试应用程序来检查是否可以根据需要使用正则表达式。我需要在提供的文本文件中查找所有重复的标签并将其替换为一些唯一的字符串。例如，如果在输入文件中找到的某些文本多于一次，则所有出现的文本都应替换为 {1}，依此类推。

为此，我创建了以下代码段：

    static void Main(string[] args)
    {
        StringBuilder xml = new StringBuilder(File.ReadAllText(@"C:\Integration\Item-26 - Copy.xml"));

        Regex r = new Regex(
            @"(?<exp>\<(?<tag>[^\<\>\s]+)[^\<\>]*\>[^\<\>]+\<\/\k<tag>\>).*\k<exp>", 
            RegexOptions.Singleline | RegexOptions.Compiled | RegexOptions.CultureInvariant);

        List<string> values = new List<string>();

        MatchCollection matches = r.Matches(xml.ToString());

        Stopwatch stopwatch = new Stopwatch();
        stopwatch.Start();

        while (matches.Count > 0)
        {
            foreach (Match m in matches)
            {
                string matchValue = m.Groups["exp"].Value;
                values.Add(matchValue);
                xml.Replace(matchValue, string.Concat("{" + (values.Count - 1) + "}"));
            }

            Console.WriteLine("Analyzed " + matches.Count + " matches, total replacements = " + values.Count);

            matches = r.Matches(xml.ToString());
        }

        stopwatch.Stop();

        Console.WriteLine("=============== " + stopwatch.Elapsed.TotalSeconds);
        Console.ReadLine();
    }

问题是如果我有一个大文件作为输入（> 1MB），那么每次查找匹配项的调用时间都比以前长。一开始调用matches.Count需要0.3秒。而在 100 次迭代之后，将需要超过 1 分钟。

我已经检查了测试应用程序的内存使用情况——它几乎没有消耗任何东西，没有任何真正的增长。

是什么原因造成的，我怎样才能获得稳定的性能？提前致谢。

score 1 · Accepted Answer

这就是我认为的问题所在。你的正则表达式是：

@"(?<exp>\<(?<tag>[^\<\>\s]+)[^\<\>]*\>[^\<\>]+\<\/\k<tag>\>).*\k<exp>"

所以你正在寻找类似的东西：

<tag>stuff</tag>lots of stuff here<tag>stuff</tag>

在第一次迭代期间，正则表达式很快失败，因为内部标签被替换，因为标签靠得很近。但是随着更多的内部标签被替换，标签之间的空间增加了。很快你就有了：

<tag>stuff</tag>hundreds of kilobytes<tag2>other stuff</tag2><tag>stuff</tag>

回溯开始杀死你。

我怀疑您可以通过将.*（或.*?我之前建议的）替换为[^\<]*. 因为您知道，当您找到 a 时，您<要么找到匹配项，要么确定失败。

c# - 正则表达式性能随着时间的推移而下降

1 回答 1

Related

Reference