c# - 匹配神秘有序的复杂正则表达式非捕获组的多个实例

Question

TL;博士

我正在使用 C# 正则表达式解析报告，启用多行，使用具有命名组的单个（复杂）正则表达式模式处理整个文件。（和 CaptureCollection。）

我的报告部分出现乱序或以我无法预测的方式丢失。

无论它们出现的顺序如何，我如何匹配它们？

前言

我正在使用 System.Text.RegularExpressions 在 C# (.Net 3.5) 中使用正则表达式解析报告。报告的一部分如下所示：

     Section Z              0 __ base 10
                            2 __ 19/04 20:06:39
                            2 __ 19/04 20:15:49
                          1.8 __ 19/04 20:09:35
                          1.6 __ 19/04 20:07:01
                          1.6 __ 19/04 20:08:29
     Section 7            0.8 __ base 10
                            8 __ 18/04 21:03:01
                          7.3 __ 18/04 21:02:17
                          3.7 __ 19/04 08:41:09
                          3.4 __ 19/04 00:13:08
                          3.3 __ 18/04 21:02:50
     Section C              0 __ base 10
                         19.7 __ 19/04 10:25:06
                         11.1 __ 19/04 10:15:01
                          8.8 __ 19/04 10:14:50
                          7.2 __ 19/04 19:51:37
                          6.1 __ 19/04 14:19:47

(?mx)我的正则表达式使用选项（MultiLine、IgnorePatternWhitespace）将文本文件作为一个整体进行匹配。因为统计部分对每个部分都有子统计信息，所以我采取了手动制作每个部分的（可选?）非捕获组 ( (?:match_this_text)) 并按照我认为它们发生的顺序将它们放入模式中，如下所示：

(?mx) #Turn on options multiline, ignore whitespace.
(?: # base 10 statistic sections
    (?:
        [\s-[\n\r]]*(?i:Section\sZ)\s+(?<base10_SectionZ>\d+\.\d|\d+)\s__\sbase\s10
        (?:\r?\n)+
        (?:\s+(?<base10_SectionZ_instance>\d+\.\d|\d+)\s__\s(?<base10_SectionZ_instance_time>\d\d/\d\d\s\d\d:\d\d:\d\d)(?:\r?\n)+)+
    )?
    (?:
        [\s-[\n\r]]*(?i:Section\s7)\s+(?<base10_Section7>\d+\.\d|\d+)\s__\sbase\s10
        (?:\r?\n)+
        (?:\s+(?<base10_Section7_instance>\d+\.\d|\d+)\s__\s(?<base10_Section7_instance_time>\d\d/\d\d\s\d\d:\d\d:\d\d)(?:\r?\n)+)+
    )?
    (?:
        [\s-[\n\r]]*(?i:Section\sC)\s+(?<base10_SectionC>\d+\.\d|\d+)\s__\sbase\s10
        (?:\r?\n)+
        (?:\s+(?<base10_SectionC_instance>\d+\.\d|\d+)\s__\s(?<base10_SectionC_instance_time>\d\d/\d\d\s\d\d:\d\d:\d\d)(?:\r?\n)+)+
    )?
)

每个部分的非捕获组的第一行匹配“部分标题”，第二行匹配标题和统计实例之间的换行符，第三行匹配单个统计实例（重复，n个实例）。

问题

生成此报告的程序根据运行的版本，以不同的顺序输出每个部分（例如，Z 部分、7 部分、C 部分），并且某些部分在某些情况下会丢失。当我针对第二个测试文件运行它时，它失败了，因为这些部分是乱序的。

因此，C 部分可能出现在 Z 部分之前，但正则表达式模式期望 Z 出现在 C 之前。

基本上，我希望相同的正则表达式匹配和提取（使用上面的命名组）相同的数据，而不管这些部分出现的顺序如何，这样它就可以匹配上面的测试数据和这个测试数据：

     Section 7            0.8 __ base 10
                            8 __ 18/04 21:03:01
                          7.3 __ 18/04 21:02:17
                          3.7 __ 19/04 08:41:09
                          3.4 __ 19/04 00:13:08
                          3.3 __ 18/04 21:02:50
     Section C              0 __ base 10
                         19.7 __ 19/04 10:25:06
                         11.1 __ 19/04 10:15:01
                          8.8 __ 19/04 10:14:50
                          7.2 __ 19/04 19:51:37
                          6.1 __ 19/04 14:19:47
     Section Z              0 __ base 10
                            2 __ 19/04 20:06:39
                            2 __ 19/04 20:15:49
                          1.8 __ 19/04 20:09:35
                          1.6 __ 19/04 20:07:01
                          1.6 __ 19/04 20:08:29

score 1 · Accepted Answer

1

你只想捕捉每个部分？

这不行吗？(Section ..*(?:\r.*){0,5})

http://regexr.com?30nfd

于 2012-04-21T01:55:29.110 回答

score 0 · Accepted Answer

我想在这种情况下，拥有几个不同的正则表达式可能比一个巨大的正则表达式更好。我会File.RealAllLines然后用 If String.Contains("Section"). 如果它包含部分，则创建一个新的部分对象，运行一个部分正则表达式来填充新的部分对象（部分名称和部分数据）。如果它不包含部分，则运行另一个正则表达式以获取其他部分数据并将其附加到当前部分对象。

score 0 · Accepted Answer

您可能希望使用 \G 选项将每个表达式锚定到前一个结果，这样您仍然可以确保您的部分之间没有任何不需要的内容。

您可以对一个部分使用更通用的表达式：

(?mx) #Turn on options multiline, ignore whitespace.
\G
(?: # base 10 statistic sections
    (?:
        [\s-[\n\r]]*(?i:Section\s(Z|7|C))\s+(?<base10_Section>\d+\.\d|\d+)\s__\sbase\s10
        (?:\r?\n)+
        (?:\s+(?<base10_Section_instance>\d+\.\d|\d+)\s__\s(?<base10_Section_instance_time>\d\d/\d\d\s\d\d:\d\d:\d\d)(?:\r?\n)+)+
    )
)

然后验证一个部分没有重复或丢失。看到它在行动。

score 0 · Accepted Answer

您不应该让正则表达式引擎选择不匹配任何内容。
在找到可选的东西之前，它会四处寻找很多“无”。

编辑

如果你只想要一个块匹配（任何顺序，但顺序），这样的东西会起作用。
你现在的方式，修改：

(?:
   (?: Section ...  (?<sec_7> 7)
   )
 | (?: Section ...  (?<sec_C> C)
   )?
 | (?: Section ...  (?<sec_Z> Z)
   )
)
(?: Section ...  (?!\k<sec_7>) (?<sec_7>  7) )?
(?: Section ...  (?!\k<sec_C>) (?<sec_C>  C) )?
(?: Section ...  (?!\k<sec_Z>) (?<sec_Z>  Z) )?

如果它可以被分解，那么这样：

(?: Section ...  (?<sec_a>(?:7|C|Z) )
(?: Section ...  (?<sec_b>(?!\k<sec_a>)(?:7|C|Z)  )?
(?: Section ...  (?<sec_c>(?!\k<sec_a>|\k<sec_b>)(?:7|C|Z)  )?
#
# Then after match check <sec_a/b/c> for its value

如果您不关心块匹配：
您的案例仅围绕 OR 条件。所以，它可以像这样简单：

# base 10 statistic sections
    (?: ..)
  |
    (?: ..)
  |
    (?: ..)

必须在 while 循环中检查“base 10”部分匹配中的每个匹配项

Match m = Regex.Match(input, regex, RegexOptions.IgnorePatternWhitespace);
while (m.Success)
{
   if (m.Groups["base10_Section7"].Success)  {    }
   else
   if (m.Groups["base10_SectionZ"].Success)  {    }
   else
   if (m.Groups["base10_SectionC"].Success)  {    }
   m = m.NextMatch();
}

甚至可以减少这种情况。例如 7,Z,C 可以组合成一个块。
这将使 OR (|) 用于匹配其他不同的项目，例如“base 2”
或任何其他形式。一种形式将匹配。无论如何都必须检查。

string input = @"
    Section Z              0 __ base 10
                           2 __ 19/04 20:06:39
                           2 __ 19/04 20:15:49
                         1.8 __ 19/04 20:09:35
                         1.6 __ 19/04 20:07:01
                         1.6 __ 19/04 20:08:29
    Section P           16.1 __ base 2
    Section 7            0.8 __ base 10
                           8 __ 18/04 21:03:01
                         7.3 __ 18/04 21:02:17
                         3.7 __ 19/04 08:41:09
                         3.4 __ 19/04 00:13:08
                         3.3 __ 18/04 21:02:50
    Section C              0 __ base 10
                        19.7 __ 19/04 10:25:06
                        11.1 __ 19/04 10:15:01
                         8.8 __ 19/04 10:14:50
                         7.2 __ 19/04 19:51:37
                         6.1 __ 19/04 14:19:47
    Section r           49.2 __ Base 2
";

string regex = @"
   # base 10 statistic sections
       (?:
         [\s-[\n\r]]*(?i:Section\s(?<base10_Section>Z|7|C)\s+(?<Base10>\d+\.\d|\d+)\s__\sbase)\s10
         (?:\r?\n)+
         (?:\s+(?<Instance>\d+\.\d|\d+)\s__\s(?<Time>\d\d/\d\d\s\d\d:\d\d:\d\d)(?:\r?\n)+)+
       )
     |  # Or, base 2 statistic sections
       (?:
         [\s-[\n\r]]*(?i:Section\s(?<base2_Section>R|P)\s+(?<Base2>\d+\.\d|\d+)\s__\sbase)\s2
         (?:\r?\n)+
       )
   # |  Or, something else

";

Match m = Regex.Match(input, regex, RegexOptions.IgnorePatternWhitespace);
int matchCount = 0;
while (m.Success)
{
    Console.WriteLine("\nMatch " + (++matchCount) + "\n------------------");
    // Check base 10
    if (m.Groups["base10_Section"].Success)
    {
        Console.WriteLine("Section (base10)  '" + m.Groups["base10_Section"] + "'  =  '" + m.Groups["Base10"] + "'\n");

        int count = m.Groups["Instance"].Captures.Count;
        // Instance
        for (int j = 0; j < count; j++)
            System.Console.WriteLine("    Instance (" + j + ") =  '" + m.Groups["Instance"].Captures[j] + "' ");
        // Time
        for (int j = 0; j < count; j++)
            System.Console.WriteLine("    Time(" + j + ") =  '" + m.Groups["Time"].Captures[j] + "' ");
        // Combined ..
        for (int j = 0; j < count; j++)
            System.Console.WriteLine("    Instance,Time  (" + j + ") =  '" +
                                          m.Groups["Instance"].Captures[j] + "' __ '" +
                                          m.Groups["Time"].Captures[j] + "' ");
    }
    else
    // Check base 2
    if (m.Groups["base2_Section"].Success)
        Console.WriteLine("Section (base2)  '" + m.Groups["base2_Section"] + "'  =  '" + m.Groups["Base2"] + "'\n");

    m = m.NextMatch();
}

c# - 匹配神秘有序的复杂正则表达式非捕获组的多个实例

TL;博士

前言

问题

4 回答 4

Related

Reference