perl - 如何解析文件、创建记录和对记录执行操作，包括术语频率和距离计算

Question

我是 Perl 入门课程的学生，正在寻找有关我编写一个分析原子数据的小（但棘手）程序的方法的建议和反馈。我的教授鼓励论坛。我不熟悉 Perl 子程序或模块（包括 Bioperl），因此请将响应限制在适当的“初学者级别”，以便我可以理解并从您的建议和/或代码中学习（也请限制“魔术”）。

该计划的要求如下：

从命令行读取一个文件（包含关于原子的数据）并创建一个原子记录数组（每个换行一个记录/原子）。对于每条记录，程序需要存储：

• 原子的序列号（第 7 - 11 列）
• 其所属氨基酸的三字母名称（第 18 - 20 列）
• 原子的三个坐标（x,y,z）（第 31 - 54 列）
•原子的一个或两个字母的元素名称（例如 C、O、N、Na）（第 77-78 列）

提示三个命令之一：频率、长度、密度 d（d 是某个数字）：

• freq - 文件中有多少每种类型的原子（例如，氮、钠等将显示如下： N：918 S：23
• length - 坐标之间的距离
• 密度 d（其中 d 是一个数字） - 程序将提示输入文件的名称以保存计算并将包含该原子与每个其他原子之间的距离。如果该距离小于或等于数字 d，它会增加原子数的计数在那个距离内，除非文件中的计数为零。输出看起来像：
1：5
2：3
3：6
...（非常大的文件），完成后将关闭。

我正在寻找有关我在下面的代码中编写（和需要编写）的内容的反馈。我特别感谢有关如何编写我的潜艇的任何反馈。我在底部包含了示例输入数据。

我看到的程序结构和功能描述：

$^W = 1; # turn on warnings
use strict; # behave!

my @fields;
my @recs;

while ( <DATA> ) {
 chomp;
 @fields = split(/\s+/);
 push @recs, makeRecord(@fields);
}

for (my $i = 0; $i < @recs; $i++) {
 printRec( $recs[$i] );
}
    my %command_table = (
 freq => \&freq,
 length => \&length,
 density => \&density,
 help => \&help, 
 quit => \&quit
 );

print "Enter a command: ";
while ( <STDIN> ) {
 chomp; 
 my @line = split( /\s+/);
 my $command = shift @line;
 if ($command !~ /^freq$|^density$|length|^help$|^quit$/ ) {
    print "Command must be: freq, length, density or quit\n";
    }
  else {
    $command_table{$command}->();
    }
 print "Enter a command: ";
 }

sub makeRecord 
    # Read the entire line and make records from the lines that contain the 
    # word ATOM or HETATM in the first column. Not sure how to do this:
{
 my %record = 
 (
 serialnumber => shift,
 aminoacid => shift,
 coordinates => shift,
 element  => [ @_ ]
 );
 return\%record;
}

sub freq
    # take an array of atom records, return a hash whose keys are 
    # distinct atom names and whose values are the frequences of
    # these atoms in the array.  

sub length
    # take an array of atom records and return the max distance 
    # between all pairs of atoms in that array. My instructor
    # advised this would be constructed as a for loop inside a for loop. 

sub density
    # take an array of atom records and a number d and will return a
    # hash whose keys are atom serial numbers and whose values are 
    # the number of atoms within that distance from the atom with that
    # serial number. 

sub help
{
    print "To use this program, type either\n",
          "freq\n",
          "length\n",
          "density followed by a number, d,\n",
          "help\n",
          "quit\n";
}

sub quit
{
 exit 0;
}

# truncating for testing purposes. Actual data is aprox. 100 columns 
# and starts with ATOM or HETATM.
__DATA__
ATOM   4743  CG  GLN A 704      19.896  32.017  54.717  1.00 66.44           C  
ATOM   4744  CD  GLN A 704      19.589  30.757  55.525  1.00 73.28           C  
ATOM   4745  OE1 GLN A 704      18.801  29.892  55.098  1.00 75.91           O

score 5 · Accepted Answer

看起来你的 Perl 技能进步很大——使用引用和复杂的数据结构。这里有一些提示和一般建议。

use warnings使用而不是启用警告$^W = 1。前者是自记录的，并且具有封闭块本地而不是全局设置的优势。
使用命名良好的变量，这将有助于记录程序的行为，而不是依赖 Perl 的特殊$_. 例如：
```
while (my $input_record = <DATA>){
}
```
在用户输入场景中，无限循环提供了一种避免重复指令（如“输入命令”）的方法。见下文。
您的正则表达式可以简化以避免重复锚点的需要。见下文。
作为一般规则，肯定测试比否定测试更容易理解。请参阅下面的修改后的if-else结构。
将程序的每个部分包含在其自己的子程序中。出于多种原因，这是一个很好的一般做法，所以我会开始养成这个习惯。
一个相关的良好做法是尽量减少全局变量的使用。作为练习，您可以尝试编写程序，使其完全不使用全局变量。相反，任何需要的信息都将在子程序之间传递。对于小程序，不一定需要严格避免全局变量，但牢记理想并不是一个坏主意。
给你的length子程序一个不同的名字。该名称已被内置length函数使用。
关于你的问题makeRecord，一种方法是忽略里面的过滤问题makeRecord。相反，makeRecord可以包括一个额外的哈希字段，过滤逻辑将驻留在其他地方。例如：
```
my $record = makeRecord(@fields);
push @recs, $record if $record->{type} =~ /^(ATOM|HETATM)$/;
```

以上几点的说明：

use strict;
use warnings;

run();

sub run {
    my $atom_data = load_atom_data();
    print_records($atom_data);
    interact_with_user($atom_data);
}

...

sub interact_with_user {
    my $atom_data = shift;
    my %command_table = (...);

    while (1){
        print "Enter a command: ";
        chomp(my $reply = <STDIN>);

        my ($command, @line) = split /\s+/, $reply;

        if ( $command =~ /^(freq|density|length|help|quit)$/ ) {
            # Run the command.
        }
        else {
            # Print usage message for user.
        }
    }
}

...

score 4 · Accepted Answer

FM的回答还不错。我只提几个额外的事情：

您已经有了一个包含有效命令的哈希（这是一个好主意）。无需在正则表达式中复制该列表。我会做这样的事情：

if (my $routine = $command_table{$command}) {
  $routine->(@line);
} else {
  print "Command must be: freq, length, density or quit\n";
}

请注意，我还将传递@line给子例程，因为密度命令需要它。不带参数的子程序可以忽略它们。

您还可以使用生成错误消息的有效命令列表keys %command_table，但我将把它作为练习留给您。

另一件事是输入文件的描述提到了列号，这表明它是一种固定宽度的格式。最好用substror解析unpack。如果一个字段是空白的或包含空格，那么您的拆分将无法正确解析它。（如果您使用substr，请注意它从 0 开始对列进行编号，而人们通常将第一列标记为 1。）

perl - 如何解析文件、创建记录和对记录执行操作，包括术语频率和距离计算

2 回答 2

Related

Reference