bash - 使用awk识别多行记录和过滤

Question

我需要处理一个包含多行记录的大数据文件，示例输入：

1  Name      Dan
1  Title     Professor
1  Address   aaa street
1  City      xxx city
1  State     yyy
1  Phone     123-456-7890
2  Name      Luke
2  Title     Professor
2  Address   bbb street
2  City      xxx city
3  Name      Tom
3  Title     Associate Professor
3  Like      Golf
4  Name
4  Title     Trainer
4  Likes     Running

请注意，第一个整数字段是唯一的，并且真正标识了整条记录。所以在上面的输入中我确实有 4 条记录，虽然我不知道每条记录可能有多少行属性。我需要： - 识别有效记录（必须有“名称”和“标题”字段） - 输出每个有效记录的可用属性，例如“名称”、“标题”、“地址”是需要的字段。

示例输出：

1  Name      Dan
1  Title     Professor
1  Address   aaa street
2  Name      Luke
2  Title     Professor
2  Address   bbb street
3  Name      Tom
3  Title     Associate Professor

所以在输出文件中，记录 4 被删除，因为它没有“名称”字段。记录 3 没有地址字段，但仍被打印到输出，因为它是具有“名称”和“标题”的有效记录。

我可以用 awk 做到这一点吗？但是我如何使用每行上的第一个“id”字段来识别整个记录？

非常感谢 unix shell 脚本专家帮助我！:)

score 6 · Accepted Answer

这似乎有效。有很多方法可以做到这一点，即使在 awk 中也是如此。

为了便于阅读，我已将其隔开。

请注意，记录 3 未显示，因为它缺少您确定为必需的“地址”字段。

#!/usr/bin/awk -f

BEGIN {
        # Set your required fields here...
        required["Name"]=1;
        required["Title"]=1;
        required["Address"]=1;

        # Count the required fields
        for (i in required) enough++;
}

# Note that this will run on the first record, but only to initialize variables
$1 != last1 {
        if (hits >= enough) {
                printf("%s",output);
        }
        last1=$1; output=""; hits=0;
}

# This appends the current line to a buffer, followed by the record separator (RS)
{ output=output $0 RS }

# Count the required fields; used to determine whether to print the buffer
required[$2] { hits++ }

END {
        # Print the final buffer, since we only print on the next record
        if (hits >= enough) {
                printf("%s",output);
        }
}

score 3 · Accepted Answer

我不擅长 awk，但我会在 Perl 中解决这个问题。这是一个 Perl 解决方案：对于每条记录，它都会记住重要的行以及是否看到了名称和标题。在记录的末尾，如果满足所有条件，则打印该记录。

#!/usr/bin/perl
use warnings;
use strict;

my ($last, $has_name, $has_title, @record);
while (<DATA>) {
    my ($id, $key, $value) = split;
    if ($id != $last and @record) {
        print @record if $has_name and $has_title;
        undef @record;
        undef $has_name;
        undef $has_title;
    }
    $has_name  = 1 if $key eq 'Name';
    $has_title = 1 if $key eq 'Title';
    push @record, $_ if grep $key eq $_, qw/Name Address Title/;
    $last = $id;
}


__DATA__
1  Name      Dan
1  Title     Professor
1  Address   aaa street
1  City      xxx city
1  State     yyy
1  Phone     123-456-7890
2  Name      Luke
2  Title     Professor
2  Address   bbb street
2  City      xxx city
3  Name      Tom
3  Title     Associate Professor
3  Like      Golf
4  Name
4  Title     Trainer
4  Likes     Running

bash - 使用awk识别多行记录和过滤

2 回答 2

Related

Reference