.net - .NET RegEx - 前 M 行的前 N 个字符

Question

我想要以下 4 个基本情况的 4 个通用 RegEx 表达式：

最多 A 字符，从 B 字符开始，从行首到 C 行，从 D 行开始，从文件开始
最多 A 字符从 B 字符开始从行首到 C 行出现在 D 行之前从文件结尾
从行尾到 B 字符之前的 A 字符从文件开头的 D 行之后开始的 C 行
最多 A 字符从行尾开始在 B 字符之前从文件末尾开始在 D 行之前最多 C 行

这些将允许在文件中的任何位置选择任意文本块。

到目前为止，我已经设法提出了仅适用于行和字符的案例：

(?<=(?m:^[^\r]{N}))[^\r]{1,M}= 每行最多 M 个字符，在前 N 个字符之后
[^\r]{1,M}(?=(?m:.{N}\r$)) = 每行最多 M 个字符，在最后 N 个字符之前

以上 2 个表达式用于字符，它们返回许多匹配项（每行一个）。

(?<=(\A([^\r]*\r\n){N}))(?m:\n*[^\r]*\r$){1,M}= 前 N 行后最多 M 行
(((?=\r?)\n[^\r]*\r)|((?=\r?)\n[^\r]+\r?)){1,M}(?=((\n[^\r]*\r)|(\n[^\r]+\r?)){N}\Z)= UP TO M lines BEFORE LAST N lines from end

这两个表达式是行的等价物，但它们总是只返回一个匹配项。

任务是结合这些表达式以允许场景 1-4。任何人都可以帮忙吗？

请注意，问题标题中的案例只是场景 #1 的子类，其中 B = 0 和 D = 0。

示例 1：第 3-5 行的字符 3-6。一共3场比赛。

来源：

line1 blah 1
line2 blah 2
line3 blah 3
line4 blah 4
line5 blah 5
line6 blah 6

结果：

<match>ne3 </match>
<match>ne4 </match>
<match>ne5 </match>

示例 2：最后 1 行之前 2 行的最后 4 个字符。一共2场比赛。

来源：

line1 blah 1
line2 blah 2
line3 blah 3
line4 blah 4
line5 blah 5
line6 blah 6

结果：

<match>ah 4</match>
<match>ah 5</match>

score 2 · Accepted Answer

这是基本案例 2 的一个正则表达式：

Regex regexObj = new Regex(
    @"(?<=              # Assert that the following can be matched before the current position
     ^                # Start of line
     .{2}             # 2 characters (B = 2)
    )                 # End of lookbehind assertion
    .{1,3}            # Match 1-3 characters (A = 3)
    (?=               # Assert that the following can be matched after the current position
     .*$              # rest of the current line
     (?:\r\n.*){2,4}  # 2 to 4 entire lines (D = 2, C = 4+1-2)
     \z               # end of the string
    )", 
    RegexOptions.Multiline | RegexOptions.IgnorePatternWhitespace);

在文中

line1 blah 1
line2 blah 2
line3 blah 3
line4 blah 4
line5 blah 5
line6 blah 6

它会匹配

ne2
ne3
ne4

(ne2从倒数第五行 (C+D = 5) 中的第三个字符 (B=2) 开始，以此类推)

score 1 · Accepted Answer

编辑：根据您的评论，听起来这确实是您无法控制的。我发布这个答案的原因是我觉得经常，尤其是在涉及正则表达式时，开发人员很容易陷入技术挑战而忽视实际目标：解决问题。我知道我也是这样的。我认为这只是技术和创造性思维的不幸结果。

因此，如果可能的话，我想将您的注意力重新集中在手头的问题上，并强调，在存在丰富的工具集的情况下，Regex 不是这项工作的正确工具。如果由于您无法控制的原因，它是您唯一可以使用的工具，那么您当然别无选择。

我认为您可能有真正的理由要求使用正则表达式解决方案；但是由于没有完全解释这些原因，我觉得您仍然有可能只是固执；）

你说这需要在正则表达式中完成，但我不相信！

首先，我仅限于 .NET 2.0 [ . . . ]

没问题。谁说你需要LINQ 来解决这样的问题？LINQ 只是让事情变得更容易；它不会让不可能的事情成为可能。

例如，这是您可以从您的问题中实现第一个案例的一种方法（将其重构为更灵活的东西会相当简单，您也可以涵盖案例 2-3）：

public IEnumerable<string> ScanText(TextReader reader,
                                    int start,
                                    int count,
                                    int lineStart,
                                    int lineCount)
{
    int i = 0;
    while (i < lineStart && reader.Peek() != -1)
    {
        reader.ReadLine();
        ++i;
    }

    i = 0;
    while (i < lineCount && reader.Peek() != -1)
    {
        string line = reader.ReadLine();

        if (line.Length < start)
        {
            yield return ""; // or null? or continue?
        }
        else
        {
            int length = Math.Min(count, line.Length - start);
            yield return line.Substring(start, length);
        }

        ++i;
    }
}

因此，对于一般问题，有一个对 .NET 2.0 友好的解决方案，无需使用正则表达式（或 LINQ）。

其次，我需要 RegEx 的灵活性，以允许在这些 [ . . . ]

也许我只是太密集了；是什么阻止您从非正则表达式开始，然后在此之上使用正则表达式进行更“复杂”的行为？例如，如果你需要对上面返回的行做额外的处理ScanText，你当然可以使用 Regex 来做。但是从一开始就坚持使用正则表达式似乎......我不知道，只是没有必要。

不幸的是，由于项目的性质，它必须在 RegEx [. . . ]

如果真是这样，那就太好了。但是，如果您的原因仅是上述摘录中的原因，那么我不同意问题的这个特定方面（从某些文本行中扫描某些字符）需要使用正则表达式来解决，即使其他方面需要正则表达式问题不在本题范围内。

另一方面，如果你因为某种任意的原因被迫使用正则表达式——比如说，有人选择编写一些需求/规范，可能没有考虑太多，那么正则表达式将用于此任务——好吧，我个人建议与之抗争。向任何有能力更改此要求的人解释说，Regex 不是必需的，并且可以在不使用 Regex 的情况下轻松解决问题......或使用“正常”代码和 Regex 的组合。

我能想到的唯一另一种可能性（尽管这可能是我自己缺乏想象力的结果）可以解释您需要使用正则表达式来解决您在问题中描述的问题是您只能使用特定的专门接受正则表达式作为用户输入的工具。但是你的问题被标记.net了，所以我不得不假设你可以在一定程度上编写自己的代码来解决这个问题。如果是这种情况，那么我会再说一遍：我认为您不需要 Regex ;)

score 1 · Accepted Answer

对于初学者，这是“基本案例 1”的答案：

Regex regexObj = new Regex(
    @"(?<=            # Assert that the following can be matched before the current position
     \A               # Start of string
     (?:.*\r\n){2,4}  # 2 to 4 entire lines (D = 2, C = 4+1-2)
     .{2}             # 2 characters (B = 2)
    )                 # End of lookbehind assertion
    .{1,3}            # Match 1-3 characters (A = 3)", 
    RegexOptions.IgnorePatternWhitespace);

您现在可以使用

Match matchResults = regexObj.Match(subjectString);
while (matchResults.Success) {
    // matched text: matchResults.Value
    // match start: matchResults.Index
    // match length: matchResults.Length
    matchResults = matchResults.NextMatch();
}

所以，在文中

line1 blah 1
line2 blah 2
line3 blah 3
line4 blah 4
line5 blah 5
line6 blah 6

它会匹配

ne3
ne4
ne5

score 1 · Accepted Answer

这是基本案例3的一个：

Regex regexObj = new Regex(
    @"(?<=            # Assert that the following can be matched before the current position
     \A               # Start of string
     (?:.*\r\n){2,4}  # 2 to 4 entire lines (D = 2, C = 4+1-2)
     .*               # any number of characters
    )                 # End of lookbehind assertion
    (?=               # Assert that the following can be matched after the current position
     .{8}             # 8 characters (B = 8)
     $                # end of line
    )                 # End of lookahead assertion
    .{1,3}            # Match 1-3 characters (A = 3)", 
    RegexOptions.Multiline | RegexOptions.IgnorePatternWhitespace);

所以在文中

line1 blah 1
line2 blah 2
line3 blah 3
line4 blah 4
line5 blah 5
line6 blah 6

它会匹配

3 b
4 b
5 b

（3 b因为它是 3 个字符（A = 3），从倒数第 8 个字符开始（B = 8），从第三行开始（D = 2），等等）

score 1 · Accepted Answer

最后是基本案例 4 的一种解决方案：

Regex regexObj = new Regex(
    @"(?=             # Assert that the following can be matched after the current position
     .{8}             # 8 characters (B = 8)
     (?:\r\n.*){2,4}  # 2 to 4 entire lines (D = 2, C = 4+1-2)
     \z               # end of the string
    )                 # End of lookahead assertion
    .{1,3}            # Match three characters (A = 3)", 
    RegexOptions.IgnorePatternWhitespace);

在文中

line1 blah 1
line2 blah 2
line3 blah 3
line4 blah 4
line5 blah 5
line6 blah 6

这将匹配

2 b
3 b
4 b

（2 b因为它是三个字符（A = 3），从倒数第五行（C+D = 5）的倒数第 8 个字符（B = 8）开始，等等）

score 0 · Accepted Answer

你为什么不做这样的事情：

//Assuming you have it read into a string name sourceString
String[] SplitString = sourceString.Split(Environment.Newline); //You will probably need to account for any line delimeter
String[M] NewStrings;
for(i=0;i<M;i++) {
    NewStrings[i] = SplitString[i].SubString(0,N) //Or (N, SplitString[i].Length -1) depending on what you need
}

你不需要 RegEx，你不需要 LINQ。

好吧，我重新阅读了您问题的开头，您可以简单地参数化 for 循环的开始和结束以及 Split 以获得您所需要的。

score 0 · Accepted Answer

请原谅我有两点：

我提出了不完全基于正则表达式的解决方案。我知道，我读到您需要纯正则表达式解决方案。但是我遇到了一个有趣的问题，我很快得出结论，使用正则表达式来解决这个问题过于复杂了。我无法用纯正则表达式解决方案来回答。我找到了以下的，我给他们看；也许，他们可以给你一些想法。
我不知道 C# 或 .NET，只知道 Python。由于所有语言中的正则表达式几乎相同，我以为我会只用正则表达式来回答，这就是我开始搜索这个问题的原因。现在，我在 Python 中展示我的解决方案都是一样的，因为我认为无论如何它很容易理解。

我认为很难通过唯一的正则表达式来捕获文本中所有出现的字母，因为在几行中找到几个字母在我看来似乎是在匹配中找到嵌套匹配的问题（也许我不够熟练在正则表达式中）。

所以我想最好先搜索所有行中所有出现的字母并将它们放在一个列表中，然后通过在列表中切片来选择希望出现的位置。

对于一行中的字母搜索，一个正则表达式首先对我来说似乎没问题。所以使用函数 selectRE() 的解决方案。

后来，我意识到选择一行中的字母与在方便的索引处切片一行相同，这与切片列表相同。因此函数 select()。

我把这两个解一起给出，所以可以验证两个函数的两个结果是否相等。

import re

def selectRE(a,which_chars,b,x,which_lines,y,ch):
    ch = ch[:-1] if ch[1]=='\n' else ch # to obtain an exact number of lines
    NL = ch.count('\n') +1 # number of lines

    def pat(a,which_chars,b):
        if which_chars=='to':
            print repr(('.{'+str(a-1)+'}' if a else '') + '(.{'+str(b-a+1)+'}).*(?:\n|$)')
            return re.compile(('.{'+str(a-1)+'}' if a else '') + '(.{'+str(b-a+1)+'}).*(?:\n|$)')
        elif which_chars=='before':
            print repr('.*(.{'+str(a)+'})'+('.{'+str(b)+'}' if b else '')+'(?:\n|$)')
            return re.compile('.*(.{'+str(a)+'})'+('.{'+str(b)+'}' if b else '')+'(?:\n|$)')
        elif which_chars=='after':
            print repr(('.{'+str(b)+'}' if b else '')+'(.{'+str(a)+'}).*(?:\n|$)')
            return re.compile(('.{'+str(b)+'}' if b else '')+'(.{'+str(a)+'}).*(?:\n|$)')

    if   which_lines=='to'    :  x   = x-1
    elif which_lines=='before':  x,y = NL-x-y,NL-y
    elif which_lines=='after' :  x,y = y,y+x

    return pat(a,which_chars,b).findall(ch)[x:y]


def select(a,which_chars,b,x,which_lines,y,ch):
    ch = ch[:-1] if ch[1]=='\n' else ch # to obtain an exact number of lines
    NL = ch.count('\n') +1 # number of lines

    if   which_chars=='to'    :  a   = a-1
    elif which_chars=='after' :  a,b = b,a+b

    if   which_lines=='to'    :  x   = x-1
    elif which_lines=='before':  x,y = NL-x-y,NL-y
    elif which_lines=='after' :  x,y = y,y+x

    return [ line[len(line)-a-b:len(line)-b] if which_chars=='before' else line[a:b]
             for i,line in enumerate(ch.splitlines()) if x<=i<y ]


ch = '''line1 blah 1
line2 blah 2
line3 blah 3
line4 blah 4
line5 blah 5
line6 blah 6
'''
print ch,'\n'

print 'Characters 3-6 of lines 3-5. A total of 3 matches.'
print selectRE(3,'to',6,3,'to',5,ch)
print   select(3,'to',6,3,'to',5,ch)
print
print 'Characters 1-5 of lines 4-5. A total of 2 matches.'
print selectRE(1,'to',5,4,'to',5,ch)
print   select(1,'to',5,4,'to',5,ch)
print
print '7 characters before the last 3 chars of lines 2-6. A total of 5 matches.'
print selectRE(7,'before',3,2,'to',6,ch)
print   select(7,'before',3,2,'to',6,ch)
print
print '6 characters before the 2 last characters of 3 lines before the 3 last lines.'
print selectRE(6,'before',2,3,'before',3,ch)
print   select(6,'before',2,3,'before',3,ch)
print 
print '4 last characters of 2 lines before 1 last line. A total of 2 matches.'
print selectRE(4,'before',0,2,'before',1,ch)
print   select(4,'before',0,2,'before',1,ch)
print
print 'last 1 character of 4 last lines. A total of 2 matches.'
print selectRE(1,'before',0,4,'before',0,ch)
print   select(1,'before',0,4,'before',0,ch)
print
print '7 characters before the last 3 chars of 3 lines after the 2 first lines. A total of 5 matches.'
print selectRE(7,'before',3,3,'after',2,ch)
print   select(7,'before',3,3,'after',2,ch)
print
print '5 characters before the 3 last chars of the 5 first lines'
print selectRE(5,'before',3,5,'after',0,ch)
print   select(5,'before',3,5,'after',0,ch)
print
print 'Characters 3-6 of the 4 first lines'
print selectRE(3,'to',6,4,'after',0,ch)
print   select(3,'to',6,4,'after',0,ch)
print
print '9 characters after the 2 first chars of the 3 lines after the 1 first line'
print selectRE(9,'after',2,3,'after',1,ch)
print   select(9,'after',2,3,'after',1,ch)

结果

line1 blah 1
line2 blah 2
line3 blah 3
line4 blah 4
line5 blah 5
line6 blah 6


Characters 3-6 of lines 3-5. A total of 3 matches.
'.{2}(.{4}).*(?:\n|$)'
['ne3 ', 'ne4 ', 'ne5 ']
['ne3 ', 'ne4 ', 'ne5 ']

Characters 1-5 of lines 4-5. A total of 2 matches.
'.{0}(.{5}).*(?:\n|$)'
['line4', 'line5']
['line4', 'line5']

7 characters before the last 3 chars of lines 2-6. A total of 5 matches.
'.*(.{7}).{3}(?:\n|$)'
['ne2 bla', 'ne3 bla', 'ne4 bla', 'ne5 bla', 'ne6 bla']
['ne2 bla', 'ne3 bla', 'ne4 bla', 'ne5 bla', 'ne6 bla']

6 characters before the 2 last characters of 3 lines before the 3 last lines.
'.*(.{6}).{2}(?:\n|$)'
['2 blah', '3 blah', '4 blah']
['2 blah', '3 blah', '4 blah']

4 last characters of 2 lines before 1 last line. A total of 2 matches.
'.*(.{4})(?:\n|$)'
['ah 5', 'ah 6']
['ah 5', 'ah 6']

last 1 character of 4 last lines. A total of 2 matches.
'.*(.{1})(?:\n|$)'
['4', '5', '6']
['4', '5', '6']

7 characters before the last 3 chars of 3 lines after the 2 first lines. A total of 5 matches.
'.*(.{7}).{3}(?:\n|$)'
['ne3 bla', 'ne4 bla', 'ne5 bla']
['ne3 bla', 'ne4 bla', 'ne5 bla']

5 characters before the 3 last chars of the 5 first lines
'.*(.{5}).{3}(?:\n|$)'
['1 bla', '2 bla', '3 bla', '4 bla', '5 bla']
['1 bla', '2 bla', '3 bla', '4 bla', '5 bla']

Characters 3-6 of the 4 first lines
'.{2}(.{4}).*(?:\n|$)'
['ne1 ', 'ne2 ', 'ne3 ', 'ne4 ']
['ne1 ', 'ne2 ', 'ne3 ', 'ne4 ']

9 characters after the 2 first chars of the 3 lines after the 1 first line
'.{2}(.{9}).*(?:\n|$)'
['ne2 blah ', 'ne3 blah ', 'ne4 blah ']
['ne2 blah ', 'ne3 blah ', 'ne4 blah ']

现在我将研究 Tim Pietzcker 的棘手解决方案

.net - .NET RegEx - 前 M 行的前 N ​​个字符

7 回答 7

Related

Reference

.net - .NET RegEx - 前 M 行的前 N 个字符