perl - 无法摆脱相同的记录

Question

我有一个输入文件存在这么多冗余记录，我尝试编写一个程序来删除部分冗余，但似乎仍然存在一些冗余，但我无法找出它有什么问题

ARGV[0] 是带冗余的输入文件

ARGV[1] 是没有输入文件冗余的输出文件

open(Input,"<./$ARGV[0]");
open(Output,">./$ARGV[1]");

while( eof(Input) !=1)
{
    push(@Records,readline(*Input));
}
close Input;

# Solution 2
for($i=0;$i<$#Records;$i++)
{
    for($j=$i+1;$j<$#Records;$j++)
    {
        if($Records[$i] eq $Records[$j])
        {
            $Records[$j] = undef;
        }
    }
}

@Records = grep defined,@Records;

=begin
# Solution 1 have some problems
for($i=0;$i<$#Records;$i++)
{
    for($j=$i+1;$j<$#Records;$j++)
    {
        if($Records[$i] eq $Records[$j])
        {
            splice @Records,$j,1;
            $j = $j-1;  
        }
    }
}
=end
=cut

foreach $Each(@Records)
{
    print Output $Each;
}
close Output;

谢谢

score 2 · Accepted Answer

这是一个更现代的 perl 解决方案：

open(my $fh_input, '<', $ARGV[0]) or die $!;
open(my $fh_output, '>', $ARGV[1]) or die $!;
my %records = ();

while( my $line = <$fh_input> )
{
   $records{$line} = 1;
}

foreach my $record(keys %records)
{
    print $fh_output $record;
}

close $fh_input;
close $fh_output;

如您所见，我使用哈希来避免重复

score 1 · Accepted Answer

您可以简单地使用uniq().

my @records;
while( eof(Input) !=1)
{
    push(@records,readline(*Input));
}
close Input;

@records = uniq(@records); ## Unique elements in @records

请在此处查看其文档。

score 1 · Accepted Answer

您的“解决方案 1”是最接近的。将数组元素设置为undef不会将其删除，并且如果您应该启用警告，则会导致警告消息。

此解决方案检查索引处的每条记录，$j如果splice它是重复的，则将其删除（这会将剩余的记录打乱，以便下一个要比较的记录位于同一索引处）或将其保留在原位并跳过它递增$j。

最佳实践是使用词法文件句柄（如$infh）而不是裸词文件句柄（如Input）。你也应该使用的三参数形式open，并经常检查它是否成功。在这里，我习惯于autodie避免open明确地检查每一个。open如果任何调用失败，它将抛出异常。

use strict;
use warnings;
use autodie;

my ($infile, $outfile) = @ARGV;

my @records = do {
    open my $infh, '<', $infile;
    <$infh>;
};

for my $i (0..$#records-1) {
    my $j = $i + 1;
    while ($j < @records) {
        if ($records[$j] eq $records[$i]) {
            splice @records, $j, 1;
        }
        else {
            ++$j;
        }
    }
}

open my $outfh, '>', $outfile;
print $outfh $_ for @records;
close $outfh;

使用哈希的替代解决方案如下所示

use strict;
use warnings;
use autodie;

my ($infile, $outfile) = @ARGV;

open my $infh,  '<', $infile;
open my $outfh, '>', $outfile;

my %seen;

while (<$infh>) {
  print $outfh $_ unless $seen{$_}++;
}

perl - 无法摆脱相同的记录

3 回答 3

Related

Reference