perl - 如何从 Perl 中的 2 个文件之一中删除公共行？

Question

我有 2 个文件，一个小一个，一个大一个。小文件是大文件的子集。

例如：

小文件：

solar:1000
alexey:2000

大文件：

andrey:1001
solar:1000
alexander:1003
alexey:2000

我想从 Big.txt 中删除 Small.txt 中也存在的所有行。换句话说，我想删除大文件中小文件共有的行。

因此，我编写了一个 Perl 脚本，如下所示：

#! /usr/bin/perl

use strict;
use warnings;

my ($small, $big, $output) = @ARGV;

open(BIG, "<$big") || die("Couldn't read from the file: $big\n");
my @contents = <BIG>;
close (BIG);

open(SMALL, "<$small") || die ("Couldn't read from the file: $small\n");

while(<SMALL>)
{
    chomp $_;
    @contents = grep !/^\Q$_/, @contents;
}

close(SMALL);

open(OUTPUT, ">>$output") || die ("Couldn't open the file: $output\n");

print OUTPUT @contents;
close(OUTPUT);

但是，此 Perl 脚本不会删除 Big.txt 中 Small.txt 共有的行

在这个脚本中，我首先打开大文件流并将整个内容复制到数组@contents 中。然后，我遍历小文件中的每个条目并检查它是否存在于大文件中。我从大文件中过滤该行并将其保存回数组中。

我不确定为什么这个脚本不起作用？谢谢

score 4 · Accepted Answer

您的脚本不起作用，因为grep使用$_并接管（持续时间grep）循环中的旧值$_（例如，$_您在正则表达式中使用的变量不是用于在while块中存储循环值的变量 - 它们是名称相同，但范围不同）。

改用命名变量（通常，永远不要$_用于任何超过 1 行的代码，正是为了避免这种类型的错误）：

while (my $line=<SMALL>) {
    chomp $line;
    @contents = grep !/^\Q$line/, @contents;
}

但是，正如 Oleg 指出的，更有效的解决方案是将小文件的行读入哈希，然后处理大文件 ONCE，检查哈希内容（我也对样式进行了一些改进-以后随时学习和使用，使用词法文件句柄变量、3-arg 形式的打开和 IO 错误打印（通过$!）：

#! /usr/bin/perl

use strict;
use warnings;

my ($small, $big, $output) = @ARGV;

use File::Slurp;
my @small = read_file($small);
my %small = map { ($_ => 1) } @small;

open(my $big, "<", $big) or die "Can not read $big: Error: $!\n";
open(my $output, ">", $output) or die "Can not write to $output: Error: $!\n";

while(my $line=<$big>) {
    chomp $line;
    next if $small{$line}; # Skip common
    print $output "$line\n";
}

close($big);
close($output);

score 3 · Accepted Answer

它不起作用有几个原因。首先，行中@content的行仍然有它们的换行符。其次，当你grep, $_in!/^\Q$_/不是设置为小文件的最后一行，而是设置为@contents数组的每个元素时，有效地使它：对于列表中的每个元素，返回除此元素之外的所有内容，最后留下空列表。

这并不是真正的好方法——你正在读取大文件，然后尝试重新处理它几次。首先，读取一个小文件并将每一行放入哈希中。然后在循环中读取大文件while(<>)，这样您就不会完全浪费内存。在每一行上，检查是否键入exists先前填充的哈希，如果是 - 进行next迭代，否则打印该行。

score 1 · Accepted Answer

这是您的问题的一个小而有效的解决方案：

#!/usr/bin/perl

use strict;
use warnings;

my ($small, $big, $output) = @ARGV;

my %diffx;

open my $bfh, "<", $big or die "Couldn't read from the file $big: $!\n";
# load big file's contents
my @big = <$bfh>;
chomp @big;
# build a lookup table, a structured table for big file
@diffx{@big} = ();
close $bfh or die "$!\n";

open my $sfh, "<", $small or die "Couldn't read from the file $small: $!\n";
my @small = <$sfh>;
chomp @small;
# delete the elements that exist in small file from the lookup table
delete @diffx{@small};
close $sfh;

# print join "\n", keys %diffx;

open my $ofh, ">", $output or die "Couldn't open the file $output for writing: $!\n";
# what is left is unique lines from big file
print $ofh join "\n", keys %diffx;  
close $ofh;

__END__

PS 我从Perl Cookbook, 2nd Edition中学到了这个技巧和许多其他技巧。谢谢

perl - 如何从 Perl 中的 2 个文件之一中删除公共行？

3 回答 3

Related

Reference