linux - 查找和删除符合条件的行

Question

我有三个不同的文件，其中包含数字列。这些文件非常大（其中有 50,000,000+ 行）

例如，数据格式看起来像

1.2 22.333 10002.3432 223.2111
50.2166 2.873 15402.3432 322.1
.
.
.

对于每个文件（file1、file2 和 file3），我需要执行以下操作：

FILE1 查找包含任意数字 x <=1000 的行并从 file1 中删除这些行

FILE2 查找包含任何数字 x >=1800 的行并从 file2 中删除这些行

FILE3 查找包含任何数字 1000<= x <=1800 的行并从 file3 中删除这些行。

我对 REGEX 的了解不足以弄清楚如何快速实现这一目标。任何帮助深表感谢。

score 6 · Accepted Answer

正如其他人在评论中提到的那样，正则表达式在这种情况下并不理想。

这是使用的一种方法awk：

awk '{for (i=1;i<=NF;i++) {if ($i<=1000) next}; if (NF) print}' file1 > new1

这会解析file并抑制任何包含数字<= 1000（和空行）的行。然后将输出通过管道传输到新文件。

对于file2and file3，只需更改相关 if 语句中的条件以符合您的要求。

这是一个快速的解释：

         This is repeated for each line in the input file
                                |
      -------------------------------------------------------
     /                                                       \
awk '{for (i=1;i<=NF;i++) {if ($i<=1000) next}; if (NF) print}'
      ------------------   ------------------   -------------
             |                     |                  |
     for each field/column         |                  |
                                   |                  |
                      If condition is met, skip       |
                             this line                |
                                                      |
                                          otherwise, if the line is
                                          not empty (number of fields != 0)
                                          print out the whole line.

score 5 · Accepted Answer

输入文件“sample”在哪里：

500 500 500
1000 1000 1000
2000 2000 2000
3000 3000 3000

剥离x <= 1000：

$ awk '{ for (i=1; i<=NF; i++) { if ($i <= 1000) next } print }' < sample
2000 2000 2000
3000 3000 3000

剥离x >= 1800：

$ awk '{ for (i=1; i<=NF; i++) { if ($i >= 1800) next } print }' < sample
500 500 500
1000 1000 1000

剥离1000 <= x <= 1800：

$ awk '{ for (i=1; i<=NF; i++) { if (1000 <= $i && $i <= 1800) next } print }' < sample
500 500 500
2000 2000 2000
3000 3000 3000

score 3 · Accepted Answer

这是一个相当短的 Perl 脚本，用于输出您的 FILE3：

#!/usr/bin/perl

use warnings;
use strict;

our $lower = 1000.0;
our $upper = 1800.0;

OUTER: while (<>) {
    $_ >= $lower && $_ < $upper and next OUTER for /(\S+)/g;
    print;
}

您可以调整 FILE1 和 FILE2。

（无论好坏，我的脚本都包含基本的 Perl 习惯用法，尽管脚本很简洁，但如果您不了解 Perl，它几乎无法阅读。不过，这就是在 Perl 中完成的，这是一种您会喜欢学习的脚本语言，一名嫌疑人。）

score 0 · Accepted Answer

类似以下脚本的内容应该适合您。

#!/usr/bin/perl
while(<>) {
    my $line = $_;
    foreach my $col (split ' ', $line){     #for each column
        unless ($col <= 1000) {
            print $line;
        }
        #add other statements for other files
    }
}

编辑- 使代码更高效感谢 TLP

linux - 查找和删除符合条件的行

4 回答 4

Related

Reference