这是我的起点数据文件的摘录:
Marker Distance_1 Distance_2 ID
.
.
.
30 13387412 34.80391242 seq-SN_FIRST
31 13387444 34.80391444 seq-SN_Second
31.1
31.2
32 13387555 39.80391 seq-SN_Third
.
.
.
这是一个制表符分隔的文件,包含多行,每行四个元素。第一行是标题。之后,无数行数据。垂直点实际上不在真实文件中,但它们在这里只是为了表示与显示的实际行类似的数据出现在明确显示的行的示例之前和之后。
一些数据行是“满的”,也就是说,所有四个单元格条目都包含一些东西。其他行是“空白”,只有第一个实际条目,但后面是 3 个制表符分隔的单个空格。空白行中的那些空白需要“填充”。填充将通过线性插值完成,使用前一行和紧后行的相应单元格条目。例如,第 2 列中的缺失将使用前一行的值和后一行的Distance_1 values
值进行插值。对于第 3 列的值也是如此。此处仅忽略第 4 列的值。13387444
13387555
该脚本的第一个目标是识别需要填充的数据块及其两侧的“完整”行。空行将包含 3 个选项卡式单个空格,并以这种方式进行标识。一旦找到,连续的空行集合加上侧翼的完整行被发送到子程序进行插值。
#!/usr/bin/perl
use strict;
use warnings;
die "usage: [ map positions file post SAS ]\n\n" unless @ARGV == 1;
my @file = ();
while (my $line = <$ARGV[0]>) {
chomp $line;
push(@file, $line);
}
my @consecutive_lines = (); # array collects a current set of consecutive lines requiring linear interpolation
my @filled = (); # my final array, hopefully correctly filled in
#####
# search for consecutive set of lines in @file
#####
for (my $i = 0; $i < $#file; $i++) { # $#file returns the index of the last element in @file
if ($file[$i] !~ /(?:\t\s){3}/) { # if not a blank line
# but a "full line"
push(@filled, $file[$i]); # push the header and full lines, until...
}
elsif ($file[$i] =~ /(?:\t\s){3}/) { # ...a blank line is found
push (@consecutive_lines, $file[$i - 1]); # push preceding full line into array
while ($file[$i] =~ /(?:\t\s){3}/ and $i < $#file) { # keep pushing lines, so long as they are blank
# or end of file
push(@consecutive_lines, $file[$i++]);
}
push(@consecutive_lines, $file[$i]) ; # else we reach next full line, so push it into array
my @fillme = linearInterpolation(@consecutive_lines); # send set of lines out for filling
push(@filled, @fillme); # push filled in set of lines into the final array
@consecutive_lines = (); # reset or undef array @consecutive_lines for next cycle
} # end of elsif
} # end of for loop
感谢用户@Kenosis 对上述内容提供了很多帮助,我已经对其进行了修改(希望不会被破坏)。
接下来是线性插值。正是在这里,我试图将脚本的第一阶段链接到第二阶段。到目前为止,它运行得并不好。
我的目标是将数组@incoming
交给子程序。然后将该数组拆分,以便实际的单元格条目是“可见的”并且可以由该数组索引,并被调用。我一直在试图弄清楚如何为第 2 列的值执行此操作Distance_1
。我觉得这个脚本很接近,并且在计算插值后开始偏离。
#####
# subroutine linear interpolation
#####
sub linearInterpolation {
my @incoming = @_; # array of consecutive set of lines
my @splitup; # declare new array, will be a "split up" version of @incoming
my ($A, $B, $C, $D, $E); # variables for linear interpolation
my @fillme; # declaring the "emtpy" array to be filled in
my @dist_1_fills; # array of interpolated values for dist_1
for (my $i = 0;
$i < scalar @incoming; $i++) # loop to split up lines of @incoming
{ # into indexed cell entries
chomp $incoming[$i]; # and make new array of them
my @entries = split('\t', $incoming[$i]);
push(@splitup, @entries);
}
$A = $splitup[1]; # cell entry in column 2 of preceding full line
$B = $splitup[-3]; # cell entry in column 2 of succeeding full line
$C = $splitup[2]; # cell entry in column 3 of preceding full line
$D = $splitup[-2]; # cell entry in column 3 of succeeding full line
$E = scalar @incoming - 1; # equals number of lines in the set minus 1
for (my $i = 1; $i < $E; $i++) { # need to start finding appropriate
# number interpolated values, given number of
my @dist_1_fills =
interpvalues($A, $B, $E, $i); # of lines in consecutive set of lines
for ($i = 0; $i < scalar @splitup; $i += 4) {
push(@fillme, $splitup[$i], $dist_1_fills[$i], "dist_2_fills", "--");
# fourth column values will be ignored or filled with --.
# "dist_2_fills" just occupying it's proper spot until I can figure out distance 1 fills
}
}
}
#########
sub interpvalues { # subroutine to find interpolated values
my ($A, $B, $E, $i) = @_;
my $dist_1_answers = (($B - $A) / ($E)) * $i + $A;
return $dist_1_answers;
}
代码在处理查找插值并将它们发送回代码的第一部分以最终填充数据集的第二部分变得混乱。我认为我最大的(尽管可能不是我唯一的)问题是在第二个子例程中计算出正确的值之后尝试用正确的值填充空白行。
非常感谢任何提示和线索!