perl - 在不同的文本文件中查找常见条目

Question

我是 Perl 的新手。我有八个文本文件，每个文件超过五千行。我想编写一个 perl 脚本来查找在前五个文件中找到但没有找到最后三个文件的条目（记录）。假设文件是 (A, B, C, D, E, F, G, H) 所以我想获取在AtoE但不是Fto中找到的条目H。

有人可以就如何为这份工作编写代码提供建议吗？

score 5 · Accepted Answer

如果我理解正确，您需要：

列出 AE 中的所有项目（称为列表 1）
在 FH 中制作另一个项目列表（列表 2）
找出 1 中所有不在 2 中的项目。

您将使用两个散列，而不是使用两个列表。

# Two sets of files to be compared.
my @Set1 = qw(A B C D E);
my @Set2 = qw(F G H);

# Get all the items out of each set into hash references
my $items_in_set1 = get_items(@Set1);
my $items_in_set2 = get_items(@Set2);

my %unique_to_set1;
for my $item (keys %$items_in_set1) {
    # If an item in set 1 isn't in set 2, remember it.
    $unique_to_set1{$item}++ if !$items_in_set2->{$item};
}

# Print them out
print join "\n", keys %unique_to_set1;

sub get_items {
    my @files = @_;

    my %items;
    for my $file (@files) {
        open my $fh, "<", $file or die "Can't open $file: $!";
        while( my $item = <$fh>) {
            chomp $item;
            $items{$item}++;
        }
    }

    return \%items;
}

如果它是一次性的，你可以在 shell 中完成它。

cat A B C D E | sort | uniq > set1
cat F G H | sort | uniq > set2
comm -23 set1 set2

cat A B C D E将文件一起涂抹成一个流。将其交给sort，然后uniq删除重复项（uniq除非对行进行排序，否则效果不佳）。结果被放入文件set1中。这对第二组再次进行。 comm然后在两个集合文件上使用以比较它们，仅显示set1.

perl - 在不同的文本文件中查找常见条目

1 回答 1

Related

Reference