regex - 任何人都可以建议一个匹配 4 个连续文本行的正则表达式模式吗？

Question

我正在尝试解析一个大型数据文件。在文件中，有 3 行或 4 行数据组，由空行分隔。例如：

Data Group One Name
Data Group One Datum 1
Data Group One Datum 2
Data Group One Datum 3

Data Group Two Name
Data Group Two Datum 1
Data Group Two Datum 2

Data Group Three Name
Data Group Three Datum 1
Data Group Three Datum 2
Data Group Three Datum 3

我正在寻找一种快速的方法来提取具有 4 行的所有数据组（忽略所有 3 行组）。正则表达式有没有办法在文本文件中查找所有 4 行组？或者任何其他建议的（可能使用 awk 或 sed 的）方法来做到这一点？

score 1 · Accepted Answer

不是很漂亮，但这应该可行：

/[^\n]+\n[^\n]+\n[^\n]+\n[^\n]+(?!(?:\n[^\n]+))/

或者

/(?:[^\n]+\n){3}[^\n]+(?!(?:\n[^\n]+))/

基本上，您正在寻找一个或多个非换行符，然后是换行符，一个或多个非换行符，然后是换行符，等等。

编辑：修复了我的正则表达式，它匹配超过 4 行的块。我为另一行文本添加了否定前瞻。

score 1 · Accepted Answer

我还没有测试过，但这应该适用于 awk 脚本：

#!/bin/awk -f
BEGIN {
        count = 0;
        lines = "";
}
{    
    if ($0 != "") {
        lines = lines \n $0;
        count++;
    } else if (count == 4) {
        print lines;       
    }
    if ($0 == "") {
        count = 0;
        lines = "";
    }
}

score 0 · Accepted Answer

您可以解决换行符 - 伪代码示例：

\n\n 1-or-more-characters \n 1-or-more-characters \n 1-or-more-characters \n 1-or-more-characters \n\n

score 0 · Accepted Answer

(?:.+\n){1,3}

这将捕获 1 行、2 行和 3 行。

这是贪婪的比赛。

如果您需要 3 或 4 行，您可以使用：

(?:.+\n){3,4}

或者您可以使用：

(?:[^\n]+\n){3,4}

我已经在https://regex101.com/中对其进行了测试

regex - 任何人都可以建议一个匹配 4 个连续文本行的正则表达式模式吗？

4 回答 4

Related

Reference