perl - 读取制表符分隔文件并计算出现次数并删除行

Question

我对编程和尝试解决这个问题相当陌生。我有这样的文件。

CHROM    POS     REF     ALT    10_sample.bam   11_sample.bam   12_sample.bam   13_sample.bam   14_sample.bam   15_sample.bam   16_sample.bam 
tg93    77  T   C   T   T   T   T           T
tg93    79  C   -   C       C   C   -   -   
tg93    79  C   G   C   C   C   C   G       C
tg93    80  G   A   G   G   G   G   A   A   G
tg93    81  A   C   A   A   A   A   C   C   C
tg93    86  C   A   C   C   A   A   A   A   C
tg93    105 A   G   A   A   A   A   A   G   A
tg93    108 A   G   A   A   A   A   G   A   A
tg93    114 T   C   T   T   T   T   T   C   T
tg93    131 A   C   A   A   A   A   A   A   A
tg93    136 G   C   C   G   C   C   G   G   G
tg93    150 CTCTC   -       CTCTC       -   CTCTC       CTCTC

在这个文件中，在标题中

CHROM - 名称 POS - 位置 REF - 参考 ALT - 备用 10 - 16_sample.bam - 采样 I

现在我想看看 REF 和 ALT 列中的字母出现了多少次。如果其中任何一个重复少于两次，我需要删除该行。

例如，在第一行中，我在 REF 中有“T”，在 ALT 中有“C”。我在 7 个样本中看到，有 5 个 T 和 2 个空白，没有 C。所以我需要删除这一行。

在第二行，REF 是“C”，Alt 是“-”。现在在七个样本中，我们有 3 个 C、2 个“-”和 2 个空白。所以我们将这一行保留为 C 并且 - 重复了 2 次以上。我们总是在数数时忽略空格

过滤后的最终文件是

#CHROM   POS     REF     ALT    10_sample.bam   11_sample.bam   12_sample.bam   13_sample.bam   14_sample.bam   15_sample.bam   16_sample.bam 
tg93    79  C   -   C       C   C   -   -   
tg93    80  G   A   G   G   G   G   A   A   G
tg93    81  A   C   A   A   A   A   C   C   C
tg93    86  C   A   C   C   A   A   A   A   C
tg93    136 G   C   C   G   C   C   G   G   G

我能够将列读取到数组中并在代码中显示它们，但我不确定如何启动循环来读取基数并计算它们的出现次数并保留列。谁能告诉我应该如何处理这个？或者，如果您有任何我可以修改的示例代码，这将很有帮助。

score 2 · Accepted Answer

#!/usr/bin/env perl
use strict;
use warnings;

print scalar(<>);                   # Read and output the header.

while (<>) {                        # Read a line.
   chomp;                           # Remove the newline from the line.
   my ($chrom, $pos, $ref, $alt, @samples) =
      split /\t/;                   # Parse the remainder of the line.

   my %counts;                      # Count the occurrences of sample values.
   ++$counts{$_} for @samples;      # e.g. Might end up with $counts{"G"} = 3.

   print "$_\n"                     # Print line if we want to keep it.
      if ($counts{$ref} || 0) >= 2  # ("|| 0" avoids a spurious warning.)
      && ($counts{$alt} || 0) >= 2;
}

输出：

CHROM    POS     REF     ALT    10_sample.bam   11_sample.bam   12_sample.bam   13_sample.bam   14_sample.bam   15_sample.bam   16_sample.bam 
tg93    79  C   -   C       C   C   -   -   
tg93    80  G   A   G   G   G   G   A   A   G
tg93    81  A   C   A   A   A   A   C   C   C
tg93    86  C   A   C   C   A   A   A   A   C
tg93    136 G   C   C   G   C   C   G   G   G

您在所需的输出中包含了 108，但它在七个样本中只有一个 ALT 实例。

用法：

perl script.pl file.in >file.out

或就地：

perl -i script.pl file

score 0 · Accepted Answer

这是一种不假设字段之间的制表符分隔的方法

use IO::All;
my $chrom = "tg93";
my @lines = io('file.txt')->slurp;
foreach(@lines) {
    %letters = ();

    # use regex with backreferences to extract data - this method does not depend on tab separated fields
    if(/$chrom\s+\d+\s+([A-Z-\s])\s{3}([A-Z-\s])\s{3}([A-Z-\s])\s{3}([A-Z-\s])\s{3}([A-Z-\s])\s{3}([A-Z-\s])\s{3}([A-Z-\s])\s{3}([A-Z-\s])\s{3}([A-Z-\s])/) {

        # initialize hash counts
        $letters{$1} = 0;
        $letters{$2} = 0;

        # loop through the samples and increment the counter when matches are found
        foreach($3, $4, $5, $6, $7, $8, $9) {
            if ($_ eq $1) {
                ++$letters{$1};
            }
            if ($_ eq $2) {
                ++$letters{$2};
            }
        } 

        # if the counts for both POS and REF are greater than or equal to 2, print the line
        if($letters{$1} >= 2 && $letters{$2} >= 2) {
            print $_;
        }
    }
}

perl - 读取制表符分隔文件并计算出现次数并删除行

2 回答 2

Related

Reference