arrays - 值表中缺失数据的线性插值

Question

这是我的起点数据文件的摘录：

Marker      Distance_1  Distance_2  ID
.
.
.
30          13387412    34.80391242 seq-SN_FIRST
31          13387444    34.80391444 seq-SN_Second
31.1             
31.2             
32          13387555    39.80391    seq-SN_Third
.
.
.

这是一个制表符分隔的文件，包含多行，每行四个元素。第一行是标题。之后，无数行数据。垂直点实际上不在真实文件中，但它们在这里只是为了表示与显示的实际行类似的数据出现在明确显示的行的示例之前和之后。

一些数据行是“满的”，也就是说，所有四个单元格条目都包含一些东西。其他行是“空白”，只有第一个实际条目，但后面是 3 个制表符分隔的单个空格。空白行中的那些空白需要“填充”。填充将通过线性插值完成，使用前一行和紧后行的相应单元格条目。例如，第 2 列中的缺失将使用前一行的值和后一行的Distance_1 values值进行插值。对于第 3 列的值也是如此。此处仅忽略第 4 列的值。1338744413387555

该脚本的第一个目标是识别需要填充的数据块及其两侧的“完整”行。空行将包含 3 个选项卡式单个空格，并以这种方式进行标识。一旦找到，连续的空行集合加上侧翼的完整行被发送到子程序进行插值。

#!/usr/bin/perl
use strict;
use warnings;

die "usage: [ map positions file post SAS ]\n\n" unless @ARGV == 1;

my @file = ();

while (my $line = <$ARGV[0]>) {
  chomp $line;
  push(@file, $line);
}

my @consecutive_lines = (); # array collects a current set of consecutive lines requiring linear interpolation
my @filled = ();    # my final array, hopefully correctly filled in

#####
# search for consecutive set of lines in @file
#####

for (my $i = 0; $i < $#file; $i++) {          #  $#file returns the index of the last element in @file

  if ($file[$i] !~ /(?:\t\s){3}/) {           # if not a blank line
                                              # but a "full line"
    push(@filled, $file[$i]);                 # push the header and full lines, until...
  }

  elsif ($file[$i] =~ /(?:\t\s){3}/) {        # ...a blank line is found

    push (@consecutive_lines, $file[$i - 1]); # push preceding full line into array

    while ($file[$i] =~ /(?:\t\s){3}/ and $i < $#file) {  # keep pushing lines, so long as they are blank
                                                          # or end of file
      push(@consecutive_lines, $file[$i++]);
    }

    push(@consecutive_lines, $file[$i]) ;     # else we reach next full line, so push it into array

    my @fillme = linearInterpolation(@consecutive_lines); # send set of lines out for filling

    push(@filled, @fillme);                   # push filled in set of lines into the final array

    @consecutive_lines = ();                  # reset or undef array @consecutive_lines for next cycle

  }    # end of elsif

}    # end of for loop

感谢用户@Kenosis 对上述内容提供了很多帮助，我已经对其进行了修改（希望不会被破坏）。

接下来是线性插值。正是在这里，我试图将脚本的第一阶段链接到第二阶段。到目前为止，它运行得并不好。

我的目标是将数组@incoming交给子程序。然后将该数组拆分，以便实际的单元格条目是“可见的”并且可以由该数组索引，并被调用。我一直在试图弄清楚如何为第 2 列的值执行此操作Distance_1。我觉得这个脚本很接近，并且在计算插值后开始偏离。

#####
# subroutine linear interpolation
#####

sub linearInterpolation {
  my @incoming = @_;    # array of consecutive set of lines

  my @splitup;                  # declare new array, will be a "split up" version of @incoming
  my ($A, $B, $C, $D, $E);      # variables for linear interpolation
  my @fillme;                   # declaring the "emtpy" array to be filled in
  my @dist_1_fills;             # array of interpolated values for dist_1

  for (my $i = 0;
    $i < scalar @incoming; $i++)     # loop to split up lines of @incoming
  {                                  # into indexed cell entries
    chomp $incoming[$i];             # and make new array of them
    my @entries = split('\t', $incoming[$i]);
    push(@splitup, @entries);
  }

  $A = $splitup[1];                   # cell entry in column 2 of preceding full line
  $B = $splitup[-3];                  # cell entry in column 2 of succeeding full line

  $C = $splitup[2];                   # cell entry in column 3 of preceding full line
  $D = $splitup[-2];                  # cell entry in column 3 of succeeding full line
  $E = scalar @incoming - 1;          # equals number of lines in the set minus 1

  for (my $i = 1; $i < $E; $i++) {    # need to start finding appropriate
                                      # number interpolated values, given number of
    my @dist_1_fills =
        interpvalues($A, $B, $E, $i); # of lines in consecutive set of lines

    for ($i = 0; $i < scalar @splitup; $i += 4) {
      push(@fillme, $splitup[$i], $dist_1_fills[$i], "dist_2_fills", "--");
                                      # fourth column values will be ignored or filled with --.
                                      # "dist_2_fills" just occupying it's proper spot until I can figure out distance 1 fills
    }
  }
}

#########

sub interpvalues {                  # subroutine to find interpolated values
  my ($A, $B, $E, $i) = @_;
  my $dist_1_answers = (($B - $A) / ($E)) * $i + $A;
  return $dist_1_answers;
}

代码在处理查找插值并将它们发送回代码的第一部分以最终填充数据集的第二部分变得混乱。我认为我最大的（尽管可能不是我唯一的）问题是在第二个子例程中计算出正确的值之后尝试用正确的值填充空白行。

非常感谢任何提示和线索！

score 1 · Accepted Answer

这个程序会做你需要的。它期望 inout 文件名作为命令行上的参数。

use strict;
use warnings;

my @saved;
my @needed;

while (<>) {
  chomp;
  my @fields = split /\t/;

  # Pass hrough headers and junk
  unless ($fields[0] and $fields[0] =~ /\d/) {
    print "$_\n";
    next;
  }

  # Save x-value for records without a y-value
  if ($fields[1] !~ /\d/) {
    push @needed, $fields[0];
    next;
  }

  # We have a filled-out row. Calculate any intermediate missing ones
  if (@needed) {
    if ($saved[0] == $fields[0]) {
      die sprintf qq(Duplicate marker values %.1f at line %d of "%s"\n), $saved[0], $., $ARGV;
    }
    my ($a1, $b1) = solve_linear(@saved[0,1], @fields[0,1]);
    my ($a2, $b2) = solve_linear(@saved[0,2], @fields[0,2]);
    while (@needed) {
      my $x = shift @needed;
      my $y1 = $a1 * $x + $b1;
      my $y2 = $a2 * $x + $b2;
      print join("\t", $x, $y1, $y2), "\n";
    }
  }

  print "$_\n";
  @saved = @fields;
}

sub solve_linear {
  my ($x0, $y0, $x1, $y1) = @_;
  my ($dx, $dy) = ($x1 - $x0, $y1 - $y0);
  my $aa = $dy / $dx;
  my $bb = ($y0 * $dx - $x0 * $dy)  / $dx;
  return ($aa, $bb);
}

输出

Marker  Distance_1  Distance_2  ID
.
.
.
30  13387412  34.80391242 seq-SN_FIRST
31  13387444  34.80391444 seq-SN_Second
31.1  13387455.1  35.303913996  --
31.2  13387466.2  35.803913552  --
32  13387555  39.80391  seq-SN_Third
.
.
.
Tool completed successfully

score 0 · Accepted Answer

我将代码修改为此，以便线性插值不是基于第一列中的值，而是基于第二列和第三列中的值。特别感谢用户@Kenosis 和@Borodin。我已经接受了 Kenosis 对上一个问题的回答，并且我在这里接受了 Borodin 的回答，尽管我在“回答你自己的问题”部分发布了这个修订。在这里发布修订是否可以接受？我浏览了有关此问题的常见问题解答，但尚未发现任何相关内容。

#!/usr/bin/perl
use strict; use warnings;

my @saved;
my @needed;

while (<>) {
  chomp;
  my @fields = split /\t/;

    # Does the current line actually exist AND does it contain one or more digits.
    unless ($fields[0] and $fields[0] =~ /\d/) {
    # If no, this is the header, so print it. If yes, advance.
    print "$_\n";
    #after printing header, go back to <> and read in next line.
    next;
  }

  # Is the second cell of the current line devoid of digits?
  if ($fields[1] !~ /\d/) {                
  # If no, advance. If yes, remember $field[0], the Marker.
  push @needed, $fields[0];              
  # After pushing, go back to <> and read in next line.
  next;
 }

  # If we are here, we must have a filled-out row.
  # Does @needed have any values? If no, advance. If yes,
  if (@needed) {
    if ($saved[0] == $fields[0]) {
      die sprintf qq(Duplicate marker values %.1f at line %d of "%s"\n), $saved[0], $., $ARGV;
     }
    # Else send preceding dist_1 value, succeeding dist_1 value,
    # preceding dist_2 value, succeeding dist_2 value, 
    # and number of emtpy lines to subroutine.
    my ($dist_1_interval, $dist_2_interval) = interval_sizes($saved[1], $fields[1], $saved[2],   $fields[2], scalar @needed);     
    # Current size of @needed is saved as $size and is used to help with iteration.
# So long as @needed contains values...
my $size = scalar @needed;
while (@needed) {
  # ...remove left-most Marker value from array @needed.
  my $x = shift @needed;
  # Interpolated values for dist_1 and dist_2 are
  # (respective interval size x iteration of while loop) + preceding values.
  my $new_dist_1 = ($dist_1_interval * (1 + ($size - (scalar @needed + 1)))) + $saved[1];
  my $new_dist_2 = ($dist_2_interval * (1 + ($size - (scalar @needed + 1)))) + $saved[2];
  print join("\t", $x, $new_dist_1, $new_dist_2, "--"), "\n";
  }
}     
      # We are here since current line is already a filled-in row.
      print "$_\n";
      # Print this row and assign it to @saved. Return to <>.
      @saved = @fields;
} 

sub interval_sizes {
  # $A = preceding dist_1, $B = succeeding dist_1, 
  # $C = preceding dist_2, $D = succeeding dist_2,
  # $E = number of needed distances.
  my ($A, $B, $C, $D, $E) = @_; 
  # I need an interval size for dist_1 based on difference between $B and $A.
  my $dist_1_interval = ($B - $A)/($E + 1);
  # I need an interval size for dist_2 based on difference between $D and $C.
  my $dist_2_interval = ($D - $C)/($E + 1);
  return ($dist_1_interval, $dist_2_interval);
 }

arrays - 值表中缺失数据的线性插值

2 回答 2

Related

Reference