perl - 修改 perl 脚本

Question

我对 perl 有点陌生（一般来说编程很好），并且已经收到了一个 perl 脚本（Id_script3.pl）。

来自 Id_script3.pl 的相关代码：

# main sub 
{ # closure 
# keep %species local to sub-routine but only init it once 
my %species; 
sub _init { 
    open my $in, '<', 'SpeciesId.txt' or die "could not open SpeciesId.txt: $!"; 
    my $spec; 
    while (<$in>) { 
        chomp; 
        next if /^\s*$/; # skip blank lines 
        if (m{^([A-Z])\s*=\s*(\d+(?:\.\d)?)(?:\s+AND\s+(\d+(?:\.\d)?))?$}) { 
            # handle letter = lines 
            $species{$spec}{$1} = [$2]; 
            push @{$species{$spec}{$1}}, $3 if $3; 
        } else { 
            # handle species name lines 
            $spec = $_; 
            $len = length($spec) if (length($spec) > $len); 
        } 
    } 
    close $in; 
} 
sub analyze { 
    my ($masses) = @_; 
    _init() unless %species; 
    my %data; 
    # loop over species entries 
SPEC: 
    foreach my $spec (keys %species) { 
        # loop over each letter of a species 
LTR: 
        foreach my $ltr (keys %{$species{$spec}}) { 
            # loop over each mass for a letter 
            foreach my $mass (@{$species{$spec}{$ltr}}) { 
                # skip to next letter if it is not found 
                next LTR unless exists($masses->{$mass}); 
            } 
            # if we get here, all mass values were found for the species/letter 
            $data{$spec}{cnt}++; 
        } 
    }

该脚本需要修改，其中将使用“SpeciesId3.txt”而不是脚本当前使用的“SpeciesId.txt”。

这两个文件之间存在细微差别，因此需要对脚本进行轻微修改才能使其正常运行；不同之处在于，与原始“SpeciesId.txt”相比，SpeciesId3.txt 不包含字母（A =、B =、C =），只是一个（多）长的值列表。

物种 ID.txt：

African Elephant

B = 1453.7
C = 1577.8
D = 2115.1
E = 2808.4
F = 2853.5 AND 2869.5
G = 2999.4 AND 3015.4

Indian Elephant

B = 1453.7
C = 1577.8
D = 2115.1
E = 2808.4
F = 2853.5 AND 2869.5
G = 2999.4 AND 3015.4

Rabbit

A = 1221.6 AND 1235.6
B = 1453.7
C = 1592.8
D = 2129.1
E = 2808.4
F = 2883.5 AND 2899.5
G = 2957.4 AND 2973.4

SpeciesID3.txt（要使用的文件/要修改的脚本：）

African Elephant


826.4
836.4
840.4
852.4
858.4
886.4
892.5
898.5
904.5
920.5
950.5
1001.5
1015.5
1029.5
1095.6
1105.6

Indian Elephant

835.4
836.4
840.5
852.4
868.4
877.4
886.4
892.5
894.5
898.5
908.5
920.5
950.5
1095.6
1105.6
1154.6
1161.6
1180.7
1183.6
1189.6
1196.6
1201.6
1211.6
1230.6
1261.6
1267.7


Rabbit

817.5
836.4
852.4
868.5
872.4
886.4
892.5
898.5
908.5
950.5
977.5
1029.5
1088.6
1095.6
1105.6
1125.5
1138.6
1161.6
1177.6
1182.6
1201.6
1221.6
1235.6
1267.7
1280.6
1311.6
1332.7
1378.5
1437.7
1453.7
1465.7
1469.7

如您所见，SpeciesID3.txt 的字母 (A =, B = ) 已丢失。

我已经尝试了几种尝试的“变通办法”，但还没有写出一个可行的方法。

非常感谢，

斯蒂芬。

score 2 · Accepted Answer

好吧，我不知道我是否会考虑保留该脚本，因为它看起来相当混乱，在子例程中使用脚本全局变量和奇怪的标签。这是您可能要考虑的一种方法，通过将输入记录分隔符$/设置为空字符串来使用 Perl 的段落模式。

这有点笨拙，因为chomp无法从哈希键中删除换行符，所以我使用了一个do块来补偿。do { ... }像子程序一样工作并返回其最后执行的语句的值，在这种情况下返回数组的元素。

use strict;
use warnings;
use Data::Dumper;

local $/ = "";        # paragraph mode

my %a = do { my @x = <DATA>; chomp(@x); @x; };  # read the file, remove newlines
$_ = [ split ] for values %a;                   # split numbers into arrays
print Dumper \%a;                               # print data structure

__DATA__
African Elephant


826.4
836.4
840.4
852.4
858.4
886.4
892.5
898.5
904.5
920.5
950.5
1001.5
1015.5
1029.5
1095.6
1105.6

Indian Elephant

835.4
836.4
840.5
852.4
868.4
877.4
886.4
892.5
894.5
898.5
908.5
920.5
950.5
1095.6
1105.6
1154.6
1161.6
1180.7
1183.6
1189.6
1196.6
1201.6
1211.6
1230.6
1261.6
1267.7


Rabbit

817.5
836.4
852.4
868.5
872.4
886.4
892.5
898.5
908.5
950.5
977.5
1029.5
1088.6
1095.6
1105.6
1125.5
1138.6
1161.6
1177.6
1182.6
1201.6
1221.6
1235.6
1267.7
1280.6
1311.6
1332.7
1378.5
1437.7
1453.7
1465.7
1469.7

输出：

$VAR1 = {
          'Rabbit' => [
                        '817.5',
                        '836.4',
                        '852.4',
                        '868.5',
                        '872.4',
                        '886.4',
                        '892.5',
                        '898.5',
                        '908.5',
                        '950.5',
                        '977.5',
                        '1029.5',
                        '1088.6',
                        '1095.6',
                        '1105.6',
                        '1125.5',
                        '1138.6',
                        '1161.6',
                        '1177.6',
                        '1182.6',
                        '1201.6',
                        '1221.6',
                        '1235.6',
                        '1267.7',
                        '1280.6',
                        '1311.6',
                        '1332.7',
                        '1378.5',
                        '1437.7',
                        '1453.7',
                        '1465.7',
                        '1469.7'
                      ],
          'Indian Elephant' => [
                                 '835.4',
                                 '836.4',
                                 '840.5',
                                 '852.4',
                                 '868.4',
                                 '877.4',
                                 '886.4',
                                 '892.5',
                                 '894.5',
                                 '898.5',
                                 '908.5',
                                 '920.5',
                                 '950.5',
                                 '1095.6',
                                 '1105.6',
                                 '1154.6',
                                 '1161.6',
                                 '1180.7',
                                 '1183.6',
                                 '1189.6',
                                 '1196.6',
                                 '1201.6',
                                 '1211.6',
                                 '1230.6',
                                 '1261.6',
                                 '1267.7'
                               ],
          'African Elephant' => [
                                  '826.4',
                                  '836.4',
                                  '840.4',
                                  '852.4',
                                  '858.4',
                                  '886.4',
                                  '892.5',
                                  '898.5',
                                  '904.5',
                                  '920.5',
                                  '950.5',
                                  '1001.5',
                                  '1015.5',
                                  '1029.5',
                                  '1095.6',
                                  '1105.6'
                                ]
        };

从这个相当冗长的输出中可以看出，结果是一个以动物为键、以数字为值的散列。只要您可以依靠至少由两个连续换行符分隔的名称和数字，并且数据中没有任意换行符，这种方法就可以解决问题。

score 0 · Accepted Answer

if (m{^([A-Z])\s*=\s*(\d+(?:\.\d)?)(?:\s+AND\s+(\d+(?:\.\d)?))?$}) {

这一行包含一个正则表达式，它查找一个大写字母[A-Z]后跟一个等号，两边都有可选的空格\s*=\s*。您基本上只想删除该前缀并简单地匹配一个数字(\d+(?:\.\d)?)。

因为$1, $2,$3是从最左边的左括号开始编号的，所以你想要的数字$1现在就在里面。（括号?:是非捕获的，不计算在内。）

您还需要更改变量%species，使其键是物种名称，其值只是数字列表（提取的观察值）。

所以：

if (m{^(\d+(?:\.\d)?)$}) { 
    push ${$species{$spec}}, $1; 
}

analyze子程序需要进行类似的适配（现在LTR关卡基本没了）。

perl - 修改 perl 脚本

2 回答 2

Related

Reference