perl - 改进 Perl 脚本的建议？

Question

我创建了一个 Perl 脚本，用于读取包含一些数字的文件，一个在另一个之下。我想消除重复并将新列表保存到文件中。这是我的脚本：

use strict;

my $arg = "<abs path to>\\list.txt";
open (FH, "$arg") or die "\nError trying to open the file $arg : $!";
print "Opened File : $arg\n";
my $line = "";
my @lines = <FH>;
close FH;
my $temp;
my $count = 0;
my $check = 0;
my @list;
my $flag;

for $line (@lines)
{
    $count += 1;
    $check = $count;
    $flag = 1;
    for my $next (@lines)
    {
        $check -= 1;
        if($check < 0)
        {
            if ($line == $next)
            {
                $flag = 0;
            }
        }
    }

    if($flag == 1)
    {
        push (@list, $line);
    }
}

my $newarg = "<abs path to>\\new_list.txt";
open (FWH, ">>$newarg") or die "\nError trying to open the file $newarg for writing : $!";
my $size = @list;
print FWH "\n\n*** Size = $size ***\n\n";
for my $line (@list)
{
    print FWH "$line";
}

我是一个尝试学习 Perl 的 C++ 人。所以你能不能给我推荐一些 Perl 中的 API，它可能会减少脚本的大小。我希望脚本可读且易于快速理解，因此需要间距。谢谢你。

score 4 · Accepted Answer

So you have a file of numbers and you want to remove duplicates from it while preserving order? This is a one-liner in Perl.

perl -ne 'print unless $seen{$_}++' file > newfile

Or:

# saves original in file.bak
perl -i.bak -ne 'print unless $seen{$_}++' file

If you have lines that contain other than a single number, or if you want to print out some stats, or if you want better argument handling, or if you've noticed that this doesn't de-dupe numbers that have differing whitespace, then go ahead and change this appropriately. For instance:

# whitespace/non-numbers tolerant
perl -i.bak -ne 'if (/^\s*(\d+)\s*$/) { print unless $seen{$1}++ } else { print }'

As a script, the key logic is exactly the same:

#! /usr/bin/env perl
use common::sense;
use autodie;

my $silent;
$silent = shift if (@ARGV > 0 and $ARGV[0] eq '-s');
die "usage: $0 [-s] src dest\n" unless @ARGV == 2;

open my $fi, '<', shift;
open my $fo, '>', shift;

my %seen;
while (<$fi>) {
  if (/^\s* (\d+) \s*$/x) {
    print {$fo} $_ unless $seen{$1}++;
    next;
  }
  print {$fo} $_;
}

unless ($silent) {
  say '-- de-dup stats --';
  say '-- $count $number --'
}
for (sort { $a <=> $b } keys %seen) {
  say "$seen{$_} $_"
}

EDIT: heh, I didn't even consider the case where the duplicates are all adjacent. Here there's no need for a hash:

perl -ne 'print unless $_ == $last; $last = $_' file > newfile

score 4 · Accepted Answer

没有太多可以添加到您的编码风格中，只需阅读评论：

my $arg = "<abs path to>\\list.txt";

# Use lexical file handles and 3 argument form of open:
open my $FH, '<', $arg or die "\nError trying to open the file $arg : $!";
print "Opened File : $arg\n";

my @lines = <$FH>;
close $FH;

# Define each variable in the tightest scope possible.
my $count = 0;
my @list;

for my $line (@lines)
{
    $count += 1;
    my $check = $count;
    my $flag = 1;
    for my $next (@lines)
    {
        $check -= 1;
        if($check < 0)
        {
            if ($line == $next)
            {
                $flag = 0;
            }
        }
    }

    if ($flag == 1)
    {
        push @list, $line;
    }
}

my $newarg = "<abs path to>\\new_list.txt";
open my $FWH, '>>', $newarg or die "\nError trying to open the file $newarg for writing : $!";
my $size = @list;
print $FWH "\n\n*** Size = $size ***\n\n";
for my $line (@list)
{
    # Double quotes not needed if there is nothing to interpolate.
    print $FWH $line;
}
# You forgot to close the file. For output files, this is important.
close $FWH or die "\nCannot close $newarg: $!";

不过，这就是我将如何实现您的算法：

#!/usr/bin/perl
use warnings;
use strict;

my $input_file  = 'PATH/TO/FILE.TXT';
my $output_file = "$input_file.out";

open my $IN,  '<', $input_file  or die "Cannot open $input_file: $!\n";
open my $OUT, '>', $output_file or die "Cannot open $output_file: $!\n";

my $previous = 'inf';
while (my $line = <$IN>) {
    print $OUT $line if $previous != $line;
    $previous = $line;
}

close $OUT;

score 3 · Accepted Answer

每当您必须跟踪某事时，请考虑hash。哈希有几个非常好的属性：

只有一个密钥可以存在：想象一下，如果您将所有数字存储在由该数字键入的哈希中。密钥列表包含您的所有数字，并且没有重复项。
快速键查找：假设您将数字存储在哈希中，再次由数字键入。你以前见过那个号码吗？查看该密钥是否存在。快速，简单。

这是一个快速的返工。

#! /usr/bin/env perl
use strict;
use feature qw(say);
use warnings;
use autodie;

请注意，我也use warnings有use strict. 我告诉人们use strict可以捕捉到大约 90% 的错误。好吧，use warnings可以捕获另外 9.99% 的错误。警告用于尝试打印未定义的变量，或者可能会给您带来麻烦的糟糕语法内容。

use feature qw(say);允许您使用say而不是print. 有了sayNL 就包括在内了，所以你不必一直使用\n。听起来不多，但很好听。use autodie如果您无法打开文件，它将执行诸如自动终止您的程序之类的操作。它将 Perl 变成了一种基于异常的语言。这样，如果您忘记测试某些内容，您的程序会通知您。

use constant {
    FILE         => '/path/to/file',
    OUTPUT       => '/path/to/output/file',
};

当你需要一些不变的东西时，你应该使用常数。

open my $numfile_fh, "<", FILE;  #No need for die
open my $output_fh, ">", OUTPUT;
my %number_hash;
while ( my $number = <$numfile_fh> ) {
    chomp $number;   #Always chomp after you read
    if ( not exists $number_hash{$number} ) {
        $number_hash{$number} = 1;
        say $output_fh "$number";
    }
}
close $numfile_fh;
close $output_fh;

我一次读取一个数字，但不是简单地将其写入文件，而是检查我%number_hash是否已经看到该数字。如果我没有，我将它存储在我的%number_hash并打印出来。逻辑可以这样写：

while ( my $number = <$numfile_fh> ) {
    chomp $number;   #Always chomp after you read
    next if exists $number_hash{$number};

    $number_hash{$number} = 1;
    say $output_fh "$number";
}

有人会说这是编写循环逻辑的更好方法。在这种风格中，您将消除异常（重复的数字），然后处理默认情况（打印读入的数字并将其保存在哈希中）。

请注意，这实际上并没有改变列表的顺序。你读入一个数字，只要它不是重复的，就按照你读入的顺序打印它。如果你想重新排序数字，所以它们被排序，使用两个循环：

while ( my $number = <$numfile_fh> ) {
    chomp $number;   #Always chomp after you read
     $number_hash{$number} = 1;
}

for my $number ( sort keys %number_hash ) {
    say $output_fh "$number";
}

请注意，我不费心测试数字是否在数组中。没有必要这样做，因为哈希无论如何每个值只能有一个键。

score 2 · Accepted Answer

为什么不能简单地使用其他工具，如 awk：

awk '!_[$0]++' your_file

您在 perl 中还有一个实用程序，用于获取数组中的 uniq 元素：

use List::MoreUtils qw/ uniq /;
my @unique = uniq @lines;

如果您不想使用上述实用程序，您可以使用任何方法：

my %seen;
my @unique = grep { ! $seen{$_}++ } @faculty;

或者您可以简单地使用下面的这个函数来获取 uniq 元素：

sub uniq {
    return keys %{{ map { $_ => 1 } @_ }};
}

将其称为：uniq(@myarray);

perl - 改进 Perl 脚本的建议？

4 回答 4

Related

Reference