c# - 提高大字符串上正则表达式的性能

Question

我目前在我的代码中使用正则表达式，以便从富文本文档中获取一个大字符串。正则表达式查找任何嵌入的图像并将它们解析为字节数组，我可以将其转换为 LinkedResource。我需要将应用程序中的 RichTextBox 中的 RTF 转换为有效的 HTML 文档，然后转换为可以自动发送的 MIME 编码消息。

正则表达式的问题是图像的字符串部分非常大，所以我觉得正则表达式正在尝试匹配整个字符串中的许多可能性，而实际上我只需要查看开头和本节结束。下面的正则表达式作为可选子句包含在更大的正则表达式中，例如someRegexStringA + "|" + imageRegexString + "|" + "someRegexStringB".

我可以做些什么来确保在大字符串中检查较少，以便我的应用程序在解析大量图像数据时不会出现冻结？

// The Regex itself
private static string imageRegexString = @"(?<imageCheck>\\pict)"                  // Look for the opening image tag
                                       + @"(?:\\picwgoal(?<widthNumber>[0-9]+))"   // Read the size of the image's width
                                       + @"(?:\\pichgoal(?<heightNumber>[0-9]+))"  // Read the size of the image's height
                                       + @"(?:\\pngblip(\r|\n))"                   // The image is the newline after this portion of the opening tag and information
                                       + @"(?<imageData>(.|\r|\n)+?)"              // Read the bitmap
                                       + @"(?:}+)";                                // Look for closing braces

// The expression is compiled so it doesn't take as much time during runtime
private static Regex myRegularExpression = new Regex(imageRegexString, RegexOptions.Compiled);

// Iterate through each image in the document
foreach(Match image in myRegularExpression.Matches(myDocument))
{
    // Read the image height and width
    int imageWidth = int.Parse(image.Groups["widthNumber"].Value);
    int imageHeight = int.Parse(image.Groups["heightNumber"].Value);

    // Process the image
    ProcessImageData(image.Groups["imageData"].Value);
}

score 2 · Accepted Answer

首先，我隐约记得有一个带有富文本编辑器的 InfoPath 表单，可以导出为 HTML - 所以你可能想看看它（尽管我们仍然必须单独附加图像）

至于你的模式：这很简单，只有一条可疑的线：

(?<imageData>(.|\r|\n)+?)

This has several potential problems:

+? is lazy, and for long strings causes a lot of backtracking, which may be inefficient.
.|\r|\n also seems pretty inefficient. You can use the SingleLine modifier (or inline (?s:...)).
By the way, . already matches \r.
(.|\r|\n) - This is a capturing group, unlike the (?:...) group you use elsewhere. I suspect this is killing you - in .Net, each character is saved in a stack as a Capture. You don't want that.

I'd suggest this instead, with a possessive group, just to be safe:

(?<imageData>(?>[^}]+))

Of course, it is also possible the pattern is slow because of the other alternations: someRegexStringA or someRegexStringB.

c# - 提高大字符串上正则表达式的性能

1 回答 1

Related

Reference