0

我创建了一个 Perl 脚本,用于读取包含一些数字的文件,一个在另一个之下。我想消除重复并将新列表保存到文件中。这是我的脚本:

use strict;

my $arg = "<abs path to>\\list.txt";
open (FH, "$arg") or die "\nError trying to open the file $arg : $!";
print "Opened File : $arg\n";
my $line = "";
my @lines = <FH>;
close FH;
my $temp;
my $count = 0;
my $check = 0;
my @list;
my $flag;

for $line (@lines)
{
    $count += 1;
    $check = $count;
    $flag = 1;
    for my $next (@lines)
    {
        $check -= 1;
        if($check < 0)
        {
            if ($line == $next)
            {
                $flag = 0;
            }
        }
    }

    if($flag == 1)
    {
        push (@list, $line);
    }
}

my $newarg = "<abs path to>\\new_list.txt";
open (FWH, ">>$newarg") or die "\nError trying to open the file $newarg for writing : $!";
my $size = @list;
print FWH "\n\n*** Size = $size ***\n\n";
for my $line (@list)
{
    print FWH "$line";
}

我是一个尝试学习 Perl 的 C++ 人。所以你能不能给我推荐一些 Perl 中的 API,它可能会减少脚本的大小。我希望脚本可读且易于快速理解,因此需要间距。谢谢你。

4

4 回答 4

4

So you have a file of numbers and you want to remove duplicates from it while preserving order? This is a one-liner in Perl.

perl -ne 'print unless $seen{$_}++' file > newfile

Or:

# saves original in file.bak
perl -i.bak -ne 'print unless $seen{$_}++' file

If you have lines that contain other than a single number, or if you want to print out some stats, or if you want better argument handling, or if you've noticed that this doesn't de-dupe numbers that have differing whitespace, then go ahead and change this appropriately. For instance:

# whitespace/non-numbers tolerant
perl -i.bak -ne 'if (/^\s*(\d+)\s*$/) { print unless $seen{$1}++ } else { print }'

As a script, the key logic is exactly the same:

#! /usr/bin/env perl
use common::sense;
use autodie;

my $silent;
$silent = shift if (@ARGV > 0 and $ARGV[0] eq '-s');
die "usage: $0 [-s] src dest\n" unless @ARGV == 2;

open my $fi, '<', shift;
open my $fo, '>', shift;

my %seen;
while (<$fi>) {
  if (/^\s* (\d+) \s*$/x) {
    print {$fo} $_ unless $seen{$1}++;
    next;
  }
  print {$fo} $_;
}

unless ($silent) {
  say '-- de-dup stats --';
  say '-- $count $number --'
}
for (sort { $a <=> $b } keys %seen) {
  say "$seen{$_} $_"
}

EDIT: heh, I didn't even consider the case where the duplicates are all adjacent. Here there's no need for a hash:

perl -ne 'print unless $_ == $last; $last = $_' file > newfile
于 2013-05-28T14:40:57.863 回答
4

没有太多可以添加到您的编码风格中,只需阅读评论:

my $arg = "<abs path to>\\list.txt";

# Use lexical file handles and 3 argument form of open:
open my $FH, '<', $arg or die "\nError trying to open the file $arg : $!";
print "Opened File : $arg\n";

my @lines = <$FH>;
close $FH;

# Define each variable in the tightest scope possible.
my $count = 0;
my @list;

for my $line (@lines)
{
    $count += 1;
    my $check = $count;
    my $flag = 1;
    for my $next (@lines)
    {
        $check -= 1;
        if($check < 0)
        {
            if ($line == $next)
            {
                $flag = 0;
            }
        }
    }

    if ($flag == 1)
    {
        push @list, $line;
    }
}

my $newarg = "<abs path to>\\new_list.txt";
open my $FWH, '>>', $newarg or die "\nError trying to open the file $newarg for writing : $!";
my $size = @list;
print $FWH "\n\n*** Size = $size ***\n\n";
for my $line (@list)
{
    # Double quotes not needed if there is nothing to interpolate.
    print $FWH $line;
}
# You forgot to close the file. For output files, this is important.
close $FWH or die "\nCannot close $newarg: $!";

不过,这就是我将如何实现您的算法:

#!/usr/bin/perl
use warnings;
use strict;

my $input_file  = 'PATH/TO/FILE.TXT';
my $output_file = "$input_file.out";

open my $IN,  '<', $input_file  or die "Cannot open $input_file: $!\n";
open my $OUT, '>', $output_file or die "Cannot open $output_file: $!\n";

my $previous = 'inf';
while (my $line = <$IN>) {
    print $OUT $line if $previous != $line;
    $previous = $line;
}

close $OUT;
于 2013-05-28T14:15:03.563 回答
3

每当您必须跟踪某事时,请考虑hash。哈希有几个非常好的属性:

  • 只有一个密钥可以存在:想象一下,如果您将所有数字存储在由该数字键入的哈希中。密钥列表包含您的所有数字,并且没有重复项。
  • 快速键查找:假设您将数字存储在哈希中,再次由数字键入。你以前见过那个号码吗?查看该密钥是否存在。快速,简单。

这是一个快速的返工。

#! /usr/bin/env perl
use strict;
use feature qw(say);
use warnings;
use autodie;

请注意,我也use warningsuse strict. 我告诉人们use strict可以捕捉到大约 90% 的错误。好吧,use warnings可以捕获另外 9.99% 的错误。警告用于尝试打印未定义的变量,或者可能会给您带来麻烦的糟糕语法内容。

use feature qw(say);允许您使用say而不是print. 有了sayNL 就包括在内了,所以你不必一直使用\n。听起来不多,但很好听。use autodie如果您无法打开文件,它将执行诸如自动终止您的程序之类的操作。它将 Perl 变成了一种基于异常的语言。这样,如果您忘记测试某些内容,您的程序会通知您。

use constant {
    FILE         => '/path/to/file',
    OUTPUT       => '/path/to/output/file',
};

当你需要一些不变的东西时,你应该使用常数。

open my $numfile_fh, "<", FILE;  #No need for die
open my $output_fh, ">", OUTPUT;
my %number_hash;
while ( my $number = <$numfile_fh> ) {
    chomp $number;   #Always chomp after you read
    if ( not exists $number_hash{$number} ) {
        $number_hash{$number} = 1;
        say $output_fh "$number";
    }
}
close $numfile_fh;
close $output_fh;

我一次读取一个数字,但不是简单地将其写入文件,而是检查我%number_hash是否已经看到该数字。如果我没有,我将它存储在我的%number_hash并打印出来。逻辑可以这样写:

while ( my $number = <$numfile_fh> ) {
    chomp $number;   #Always chomp after you read
    next if exists $number_hash{$number};

    $number_hash{$number} = 1;
    say $output_fh "$number";
}

有人会说这是编写循环逻辑的更好方法。在这种风格中,您将消除异常(重复的数字),然后处理默认情况(打印读入的数字并将其保存在哈希中)。

请注意,这实际上并没有改变列表的顺序。你读入一个数字,只要它不是重复的,就按照你读入的顺序打印它。如果你想重新排序数字,所以它们被排序,使用两个循环:

while ( my $number = <$numfile_fh> ) {
    chomp $number;   #Always chomp after you read
     $number_hash{$number} = 1;
}

for my $number ( sort keys %number_hash ) {
    say $output_fh "$number";
}

请注意,我不费心测试数字是否在数组中。没有必要这样做,因为哈希无论如何每个值只能有一个键。

于 2013-05-28T14:55:48.340 回答
2

为什么不能简单地使用其他工具,如 awk:

awk '!_[$0]++' your_file

您在 perl 中还有一个实用程序,用于获取数组中的 uniq 元素:

use List::MoreUtils qw/ uniq /;
my @unique = uniq @lines;

如果您不想使用上述实用程序,您可以使用任何方法:

my %seen;
my @unique = grep { ! $seen{$_}++ } @faculty;

或者您可以简单地使用下面的这个函数来获取 uniq 元素:

sub uniq {
    return keys %{{ map { $_ => 1 } @_ }};
}

将其称为:uniq(@myarray);

于 2013-05-28T13:54:41.843 回答