perl - 删除重复行并在 Perl 中对表格进行排序

Question

我有一张这样的桌子：

  + chr13   25017807    6
  + chr10   128074490   1
  - chr7    140968671   1
  + chr10   79171976    3
  - chr7    140968671   1
  + chr12   4054997     6
  + chr13   25017807    6
  + chr15   99504255    6
  - chr8    91568709    5

它已经作为字符串变量（外部 shell 脚本的返回值）读入 Perl。我需要删除重复的行并按最后一列对表格进行排序，然后将其打印出来。我应该如何在 Perl 中做到这一点？谢谢！

score 2 · Accepted Answer

假设数据包含在 string$string中，此解决方案将起作用：

my %seen;  # just needed to remove duplicates
my $deduped_string =
  join "\n",                     # 6. join the lines to a single string
  map  { join(" ", @$_) }        # 5. join the fields of each line to a string
  sort { $a->[-1] <=> $b->[-1] } # 4. sort arrayrefs by last field, numerically
  map  { [split] }               # 3. split line into fields, store in anon arrayref
  grep { not $seen{$_}++ }       # 2. dedupe the lines
  split /\n/, $string;           # 1. split string into lines

这个庞大的表达式从下到上（或从右到左）执行。它由多个可组合的变压器和滤波器组成：

map {BLOCK} LIST将块中的代码应用于列表的每个值。它按元素转换列表。
grep {BLOCK} LIST从块返回 true 的列表中选择那些元素。因此，它过滤列表并仅输出满足特定条件的元素。
sort {BLOCK} LIST度假村名单。$a如果小于，则该块必须返回 -1 ，如果大于，则返回 1 $b，或者如果相等则返回零。运算符以这种方式对<=>标量进行数值比较。如果省略排序函数，则使用字符串比较。
join STRING, LIST将列表的元素与中间的字符串连接起来。
split REGEX, STRING将字符串分成几块。正则表达式匹配分隔符（通常不返回）。split并且join可以被认为是逆运算。如果字符串被省略，$_则使用。当正则表达式被省略时，它的工作方式类似于split /\s+/, $_，即在每个空白字符处拆分。

该解决方案的核心是Schwartzian Transform，这是一种技术/习语，可以通过计算成本高昂的键进行廉价排序。在它的一般形式中，它是

my @sorted_data =
  map  { $_->[0] }                  # 3. map back to the orginal value
  sort { $a->[1] <=> $b->[1] }      # 2. sort by the special key
  map  { [$_, create_the_key($_)] } # 1. annotate each value with a key
  @data;

在我的具体情况下，特殊键是每条记录的最后一列；为了从带注释的数据中获取原始数据（或等效形式），我将这些字段连接在一起。正如mpapec指出的那样，我也可以将原始线带入变换；这将保留线条的原始对齐方式。

score 1 · Accepted Answer

对于初学者，我会这样做：

use strict; use warnings;

my $file = "table.txt";
open(my $fh, "<", $file) || die "Can't open $file: $!\n";

my @lines;

# read the file and save a transformed version to @lines
while (my $line = <$fh>) {
   chomp($line);                   # remove final newline
   $line =~ s/ +/:/gi;             # make ":" the new separator
   my @fields = split(/:/,$line);  # split at the separator
   my $newline = "$fields[4]:$fields[1]:$fields[2]:$fields[3]"; # reorder fields
   push(@lines, $newline);         # save the new line
}

@lines = sort(@lines);  # sort lines alphabetically:
                        # duplicate lines are now consecutive
my $uniqline="";        # the last unique line

foreach my $line (@lines) {
   # do this if the current line isn't string-equal to the last line
   # (i.e. skip all lines that are equal to the previous line)
   if ($uniqline ne $line) {
      $uniqline = $line;  # remember the last line
      # print fields in original order
      my @fields = split(/:/,$line);
      printf("  %s %7s %11s %s\n",$fields[1],$fields[2],$fields[3],$fields[0]);
   }
}

我得到的结果略有不同......

  +   chr10   128074490 1
  -    chr7   140968671 1
  +   chr10    79171976 3
  -    chr8    91568709 5
  +   chr12     4054997 6
  +   chr13    25017807 6
  +   chr15    99504255 6

score 1 · Accepted Answer

过滤掉重复行，最后按最后一列排序，

perl -ane 'next if $s{$_}++; push @r,[$_,@F]}{ print $$_[0] for sort { $$a[-1] <=> $$b[-1] } @r' file

几乎一样，

use strict;
use warnings;

open my $fh, "file" or die $!;
my (%seen_line, @result_unique_lines);
while (<$fh>) {

  # $_ => content of current line

  # skip current if it's duplicate
  next if $seen_line{$_}++;

  my @line_values = split;
  push @result_unique_lines, [$_, @line_values];
}
close $fh;

# sort lines
@result_unique_lines = sort { $a->[-1] <=> $b->[-1] } @result_unique_lines;

for my $aref (@result_unique_lines) {

  my $line = $aref->[0];
  print $line;  
}

perl - 删除重复行并在 Perl 中对表格进行排序

3 回答 3

Related

Reference