c# - 尽可能长地匹配具有固定列的行

Question

我将从遗留系统中解析一个位置基础文件。文件中的每一列都有固定的列宽，每行最多可以有 80 个字符。问题是你不知道一行有多长。有时他们只填写了前五列，有时使用了所有列。

如果我知道使用了所有 80 个字符，那么我可以这样做：

^\s*
 (?<a>\w{3})
 (?<b>[ \d]{2})
 (?<c>[ 0-9a-fA-F]{2})
 (?<d>.{20})
 ...

但是这样做的问题是，如果缺少最后一列，则该行将不匹配。最后一列的字符数甚至可以少于该列的最大值。

查看示例

Text to match         a   b  c  d
"AQM45A3A text   " => AQM 45 A3 "A text   "  //group d has 9 chars instead of 20
"AQM45F5"          => AQM 45 F5              //group d is missing
"AQM4"             => AQM  4                 //group b has 1 char instead of 2
"AQM4  ASome Text" => AQM  4  A "Some Text"  //group b and c only uses one char, but fill up the gap with space
"AQM4FSome Text"   => No match, group b should have two numbers, but it is only one.
"COM*A comment"    => Comments do not match (all comments are prefixed with COM*)
"       "          => Empty lines do not match

我应该如何设计正则表达式来匹配这个？

编辑 1

在此示例中，我要解析的每一行都以 AQM 开头

a 列始终从位置 0 开始
b 列始终从位置 3 开始
c 列始终从位置 5 开始
d 列始终从位置 7 开始

如果一列未使用其所有空间，则文件是否包含空格仅可修剪最后使用的列

编辑 2 为了更清楚，我在这里附上了数据可能是什么样子的一些例子，以及列的定义（请注意，我在问题前面提到的例子被大大简化了）

AQM 示例 AQM 的定义

score 3 · Accepted Answer

我不确定在这里使用正则表达式是否正确。如果我理解你的结构，你想要类似的东西

if (length >= 8) 
   d = everything 8th column on
   remove field d
else
   d = empty

if (length >= 6)
   c = everything 6th column on
   remove field c
else
   c = empty

等等。也许一个正则表达式可以做到这一点，但它可能会相当做作。

score 1 · Accepted Answer

尝试?在无法存在的组之后使用。在这种情况下，如果缺少某些组，您将获得匹配项。

编辑 n，在 Sguazz 回答之后

我会用

(?<a>AQM)(?<b>[ \d]{2})?(?<c>[ 0-9a-fA-F]{2})?(?<d>.{0,20})?

甚至 a+而不是{0,20}最后一组，如果可能有超过 20 个字符。

编辑 n+1,

这样比较好？

(?<a>\w{3})(?<b>\d[ \d])(?<c>[0-9a-fA-F][ 0-9a-fA-F])(?<d>.+)

score 1 · Accepted Answer

所以，换个说法：在您的示例中，您有一个字符序列，并且您知道前 3 个属于 A 组，接下来的 2 个属于 B 组，然后 2 个属于 C 组，20 个属于 D 组，但是可能不是这么多元素。

尝试：

(?<a>\w{0,3})(?<b>[ \d]{0,2})(?<c>[ 0-9a-fA-F]{0,2})(?<d>.{0,20})

基本上，这些数字现在是组的上限，而不是固定大小。

编辑，以反映您的最后评论：如果您知道所有相关行都以“AQM”开头，则可以将 A 组替换为(?<a>AQM)

另一个编辑：让我们试试这个。

(?<a>AQM)(?<b>[ \d]{2}|[ \d]$)(?<c>[ 0-9a-fA-F]{0,2})(?<d>.{0,20})

score 0 · Accepted Answer

Perhaps you could use a function like this one to break the string into its column values. It doesn't parse comment strings and is able to handle strings that are shorter than 80 characters. It doesn't validate the contents of the columns though. Maybe you can do that when you use the values.

/// <summary>
/// Break a data row into a collection of strings based on the expected column widths.
/// </summary>
/// <param name="input">The width delimited input data to break into sub strings.</param>
/// <returns>
/// An empty collection if the input string is empty or a comment.
/// A collection of the width delimited values contained in the input string otherwise.
/// </returns>
private static IEnumerable<string> ParseRow(string input) {
    const string COMMENT_PREFIX = "COM*";
    var columnWidths = new int[] { 3, 2, 2, 3, 6, 14, 2, 2, 3, 2, 2, 10, 7, 7, 2, 1, 1, 2, 7, 1, 1 };
    int inputCursor = 0;
    int columnIndex = 0;
    var parsedValues = new List<string>();

    if (String.IsNullOrEmpty(input) || input.StartsWith(COMMENT_PREFIX) || input.Trim().Length == 0) {
        return parsedValues;
    }

    while (inputCursor < input.Length && columnIndex < columnWidths.Length) {
        //Make sure the column width never exceeds the bounds of the input string. This can happen if the input string doesn't end on the edge of a column.
        int columnWidth = Math.Min(columnWidths[columnIndex++], input.Length - inputCursor);
        string columnValue = input.Substring(inputCursor, columnWidth);
        parsedValues.Add(columnValue);
        inputCursor += columnWidth;
    }
    return parsedValues;
}

c# - 尽可能长地匹配具有固定列的行

4 回答 4

Related

Reference