c# - 正则表达式：多次捕获中的多次捕获

Question

我有一个完美的正则表达式。

^SENT KV(?<singlelinedata> L(?<line>[1-9]\d*) (?<measureline>\d+)(?: (?<samplingpoint>\d+))+)+$

我的输入字符串如下所示：

SENT KV L1 123 1 2 3 L2 456 4 5 6

唯一的问题是：如何获取组“采样点”的所有捕获的上下文？

该组包含 6 个捕获，但我也需要上下文信息。在组“singlelinedata”的第一次捕获中有三个捕获，在第二个捕获中有三个。如何获取这些信息？

组的捕获不包含包含所包含组的所有捕获的属性。

我知道我可以编写一个正则表达式来匹配整个字符串，然后执行第二个正则表达式来解析所有“singlelinedata”捕获。

我正在寻找一种适用于指定正则表达式的方法。

希望可以有人帮帮我。

score 0 · Accepted Answer

Based on the answer of Markus Jarderot I wrote an extension method for groups that takes a capture and returns all captures of that group within the specified capture.

The extension method looks like this:

    public static IEnumerable<Capture> CapturesWithin(this Group source, Capture captureContainingGroup)
    {
        var lowerIndex = captureContainingGroup.Index;
        var upperIndex = lowerIndex + captureContainingGroup.Length - 1;

        foreach (var capture in source.Captures.Cast<Capture>())
        {
            if (capture.Index < lowerIndex)
            {
                continue;
            }

            if (capture.Index > upperIndex)
            {
                break;
            }

            yield return capture;
        }
    }

Usage of this method:

foreach (var capture in match.Groups["singlelinedata"].Captures.Cast<Capture>())
{
    var samplingpoints = match.Groups["samplingpoint"].CapturesWithin(capture).ToList();
    ...

score 0 · Accepted Answer

正则表达式 API 中没有“子组”的概念。一个组可以有多个捕获，但您无法知道哪个samplingpoint属于哪个line。

您唯一的选择是使用字符索引自己计算。

score 0 · Accepted Answer

void Main()
{
    string data = @"SENT KV L1 123 1 2 3 L2 456 4 5 6";
    Parse(data).Dump();
}

public class Result
{
    public int Line;
    public int MeasureLine;
    public List<int> SamplingPoints;
}

private Regex pattern = new Regex(@"^SENT KV(?<singlelinedata> L(?<line>[1-9]\d*) (?<measureline>\d+)(?: (?<samplingpoint>\d+))+)+$", RegexOptions.Multiline);

public IEnumerable<Result> Parse(string data)
{
    foreach (Match m in pattern.Matches(data))
    {
        foreach (Capture c1 in m.Groups["singlelinedata"].Captures)
        {
            int lineStart = c1.Index;
            int lineEnd = c1.Index + c1.Length;

            var result = new Result();
            result.Line = int.Parse(m.Groups["line"].CapturesWithin(c1).First().Value);
            result.MeasureLine = int.Parse(m.Groups["measureline"].CapturesWithin(c1).First().Value);

            result.SamplingPoints = new List<int>();
            foreach (Capture c2 in m.Groups["samplingpoint"].CapturesWithin(c1))
            {
                result.SamplingPoints.Add(int.Parse(c2.Value));
            }

            yield return result;
        }
    }
}

public static class RegexExtensions
{
    public static IEnumerable<Capture> CapturesWithin(this Group group, Capture capture)
    {
        foreach (Capture c in group.Captures)
        {
            if (c.Index < capture.Index) continue;
            if (c.Index >= capture.Index + capture.Length) break;

            yield return c;
        }
    }
}

编辑：重写为Group.

score 0 · Accepted Answer

一种不进行大量索引匹配并保留单个正则表达式的方法是将捕获组更改为都具有相同的名称。嵌套的捕获实际上首先被推入堆栈，所以你最终得到一个像这样的数组：

["1", "123", "1", "2", "3", "L1 123 1 2 3", "2", "456", "4", "5", "6", "L2 456 4 5 6"]

然后，当找到包含 L 的捕获时，将结果分成组，然后从每个组中提取数据，这只是一些 LINQ 疯狂的问题。

var regex = new Regex(@"^SENT KV(?<singlelinedata> L(?<singlelinedata>[1-9]\d*) (?<singlelinedata>\d+)(?: (?<singlelinedata>\d+))+)+$");
var matches = regex.Matches("SENT KV L1 123 1 2 3 L2 456 4 5 6 12 13 L3 789 7 8 9 10");
var singlelinedata = matches[0].Groups["singlelinedata"];

string groupKey = null;
var result = singlelinedata.Captures.OfType<Capture>()
    .Reverse()
    .GroupBy(key => groupKey = key.Value.Contains("L") ? key.Value : groupKey, value => value.Value)
    .Reverse()
    .Select(group => new { key = group.Key, data = group.Skip(1).Reverse().ToList() })
    .Select(item => new { line = item.data.First(), measureline = item.data.Skip(1).First(), samplingpoints = item.data.Skip(2).ToList() })
    .ToList();

c# - 正则表达式：多次捕获中的多次捕获

4 回答 4

Related

Reference