arrays - 如何读取文件并为每一行制作记录

Question

寻求有关编写 Perl 程序的帮助，该程序接受输入文件并根据后续命令执行操作。我是 Perl 的初学者，所以请不要太提前提出建议。到目前为止，我的结构是一个主程序和 4 个子程序。

我在两个部分遇到问题：

编写主段的一部分，为输入文件中的每一行创建一个唯一记录（它是固定宽度格式）。我认为这应该用 substr 来完成，但我不知道应该如何构建它。到目前为止，Unpack 超出了我的学习范围。

主程序中调用的函数之一是“距离”子程序，它将计算原子之间的距离。我认为这应该是 For 循环内的 For 循环。关于我应该采取什么方法的任何想法？

记录应该存储一组原子记录（每个换行一个记录/原子）：

• 原子的序列号，5 位数字。（第 7 - 11 列）

• 其所属氨基酸的三个字母名称（第 18 - 20 列）

• 原子的三个坐标实数作为十进制和正交坐标 (x,y,z) (cols 31 - 54 )
对于 X，单位为埃 cols。31-38 代表
Y，单位为 Angstroms cols。39-46
For Z in Angstroms cols。47-54

• 原子的一个或两个字母的元素名称（例如 C、O、N、Na）（第 77-78 列）

sub Distance # 获取一个原子记录数组并返回该数组中
所有原子对之间的最大距离。（第 31-54 列）

这是来自输入文件的示例文本。

# truncating for testing purposes. Actual data is aprox. 100 columns     
# and starts with ATOM or HETATM    
__DATA__   
ATOM   4743  CG  GLN A 704      19.896  32.017  54.717  1.00 66.44           C    
ATOM   4744  CD  GLN A 704      19.589  30.757  55.525  1.00 73.28           C    
ATOM   4745  OE1 GLN A 704      18.801  29.892  55.098  1.00 75.91           O

到目前为止，这是我制作记录的主要和次要内容。我讨厌跛脚，但我没有任何东西可以显示距离子，所以不要担心提供代码，任何关于如何接近的建议将不胜感激。

use warnings;
use strict; 

my @fields;
my @recs;

while ( <DATA> ) {
chomp;
@fields = split(/\s+/);
push @recs, makeRecord(@fields);
}

for (my $i = 0; $i < @recs; $i++) {
printRec( $recs[$i] );
}
my %command_table = (
  freq => \&freq,
  length => \&length,
  density => \&density,
  help => \&help, 
  quit => \&quit
);

print "Enter a command: ";
  while ( <STDIN> ) {
  chomp; 
  my @line = split( /\s+/);
  my $command = shift @line;
  if ($command !~ /^freq$|^density$|length|^help$|^quit$/ ) {
    print "Command must be: freq, length, density or quit\n";
  }
    else {
    $command_table{$command}->();
  }
print "Enter a command: ";
}

sub makeRecord 
# Read the entire line and make records from the lines that contain the 
# word ATOM or HETATM in the first column. Not sure how to do this:
{
 my %record = 
 (
 serialnumber => shift,
 aminoacid => shift,
 coordinates => shift,
 element  => [ @_ ]
 );
 return\%record;
 }

score 1 · Accepted Answer

网上有 Perl 代码可用于处理 PDB 文件（显然您正在这样做）。我不建议只使用您下载的模块并完成它，因为您的老师肯定不会批准，您也不会学到那么多；）但是您可以查看一些提供的代码和尝试看看那里的某些位是否可以解决您的问题。

我做了一点谷歌搜索，我看到有 ParsePDB.pm（例如）。你可以在这里找到网页。虽然我没有看代码或功能，但我只是希望那里会有一些对您有帮助的东西。

编辑 1

好的，现在是 14 小时后，我想写一些代码，所以你还没有接受答案，我想我可以忽略我自己的建议并草拟一些东西（你会注意到我已经复制了 Zaid 的数据结构） ...

#!/usr/bin/perl

use warnings;
use strict;

sub makeRecord {
   my ($ser_num, $aa, $x, $y, $z, $element) = @_;
   # copying Zaid now as her/his structure looks very sensible!
   my $record = {
                  serial  => $ser_num,
                  aa      => $aa,
                  element => $element,
                  xyz     => [$x, $y, $z],
                };
   return $record;
}


my $file = shift @ARGV;
my @records; # will be an array of hash references

open FILE, "<$file" or die "$!";
while (<FILE>) {
   if (/^ATOM|^HETATM/) { # only get the structure data lines
      chomp; # not necessary here, but good practice I'd say

      my @fields = split; # by default 'split' splits on whitespace

      # now use an array slice to only pass the array elements
      # you're interested in (using the positional indices from @fields):
      push @records, makeRecord(@fields[1,3,6,7,8,11]);
   }
}
close FILE;

编辑 2

关于距离子例程：for 循环内的 for 循环应该可以完成这项工作，但这是可能需要很长时间的蛮力方式（因为你必须做 (number_of_atoms)^2 计算），具体取决于大小你的输入分子。就您的任务而言，蛮力方法可能是可以接受的；在其他情况下，您必须决定是否支持易于编码或计算速度。如果您的教练也希望您记住后者，您可以查看此页面（我知道您实际上想要最大距离，并且您是 3D，而不是 2D...）

好的，现在我只希望你能在这里找到一些有用的点点滴滴:)

score 1 · Accepted Answer

unpack当我看到调度表的使用超出范围时，这很奇怪。unpack如果正在处理固定格式的文件，忽略使用将是愚蠢的。下面的代码中没有任何“高级”内容：

use strict;
use warnings;
use Data::Dump 'dump';   # Use this if you want 'dump' function to work

my @records;
while ( my $record = <DATA> ) {

    next unless $record =~ /^ATOM|^HETATM/;  # Skip unwanted records

    # unpack minimizes the amount of work the code has to do ...
    # ... especially since you only want a small part of the file
    # 'x' tokens are ignored, 'A' tokens are read ...
    # The number following each token represents repetition count ...
    # ... so in this case the first 6 characters are ignored ...
    # ... and the next 5 are assigned to $serNo

    my ( $serNo, $aminoAcid, $xCoord, $yCoord, $zCoord )
        = unpack 'x6A5x6A3x10A10A10A10', $record;        # Get only what you want

    # Assign data to a hash reference

    my $recordStructure = {
                            serialnumber => $serNo,
                            aminoacid    => $aminoAcid,
                            coordinates  => [ $xCoord, $yCoord, $zCoord ],
                          };

    push @records, $recordStructure;  # Append current record
}

# 'dump' is really useful to view data structures. No need for PrintRec!!

dump @records;

score 1 · Accepted Answer

您的记录具有固定宽度的格式，因此请使用unpack将每条记录分成感兴趣的字段。使用每个字段的规定列位置来构造一个模板以用于unpack.

my @field_specs = (
    {begin =>  7, end => 11, name => 'serialnumber'},
    {begin => 18, end => 20, name => 'aminoacid'},
    {begin => 31, end => 38, name => 'X'}, 
    {begin => 39, end => 46, name => 'Y'},
    {begin => 47, end => 54, name => 'Z'}, 
    {begin => 77, end => 78, name => 'element'},
);
my $unpack_template;    
my @col_names;
for my $spec (@field_specs) {
    my $offset = $spec->{begin} - 1;
    my $width  = $spec->{end} - $offset;
    $template .= "\@${offset}A$width";
    push @col_names, $spec->{name};
}
print "Ready to read @col_names\n using template $template ...\n";

# prints 
# Ready to read serialnumber aminoacid X Y Z element 
#  using template @6A5@17A3@30A8@38A8@46A8@76A2 ...

my @recs;
while ( <DATA> ) {                
    my %record;
    @record{@col_names} = unpack($unpack_template, $_);    
    push @recs, \%record;                
}

arrays - 如何读取文件并为每一行制作记录

3 回答 3

Related

Reference