
CHROM    POS     REF     ALT    10_sample.bam   11_sample.bam   12_sample.bam   13_sample.bam   14_sample.bam   15_sample.bam   16_sample.bam 
tg93    77  T   C   T   T   T   T           T
tg93    79  C   -   C       C   C   -   -   
tg93    79  C   G   C   C   C   C   G       C
tg93    80  G   A   G   G   G   G   A   A   G
tg93    81  A   C   A   A   A   A   C   C   C
tg93    86  C   A   C   C   A   A   A   A   C
tg93    105 A   G   A   A   A   A   A   G   A
tg93    108 A   G   A   A   A   A   G   A   A
tg93    114 T   C   T   T   T   T   T   C   T
tg93    131 A   C   A   A   A   A   A   A   A
tg93    136 G   C   C   G   C   C   G   G   G
tg93    150 CTCTC   -       CTCTC       -   CTCTC       CTCTC


CHROM - 名称 POS - 位置 REF - 参考 ALT - 备用 10 - 16_sample.bam - 采样 I

现在我想看看 REF 和 ALT 列中的字母出现了多少次。如果其中任何一个重复少于两次,我需要删除该行。

例如,在第一行中,我在 REF 中有“T”,在 ALT 中有“C”。我在 7 个样本中看到,有 5 个 T 和 2 个空白,没有 C。所以我需要删除这一行。

在第二行,REF 是“C”,Alt 是“-”。现在在七个样本中,我们有 3 个 C、2 个“-”和 2 个空白。所以我们将这一行保留为 C 并且 - 重复了 2 次以上。我们总是在数数时忽略空格


#CHROM   POS     REF     ALT    10_sample.bam   11_sample.bam   12_sample.bam   13_sample.bam   14_sample.bam   15_sample.bam   16_sample.bam 
tg93    79  C   -   C       C   C   -   -   
tg93    80  G   A   G   G   G   G   A   A   G
tg93    81  A   C   A   A   A   A   C   C   C
tg93    86  C   A   C   C   A   A   A   A   C
tg93    136 G   C   C   G   C   C   G   G   G



2 回答 2

#!/usr/bin/env perl
use strict;
use warnings;

print scalar(<>);                   # Read and output the header.

while (<>) {                        # Read a line.
   chomp;                           # Remove the newline from the line.
   my ($chrom, $pos, $ref, $alt, @samples) =
      split /\t/;                   # Parse the remainder of the line.

   my %counts;                      # Count the occurrences of sample values.
   ++$counts{$_} for @samples;      # e.g. Might end up with $counts{"G"} = 3.

   print "$_\n"                     # Print line if we want to keep it.
      if ($counts{$ref} || 0) >= 2  # ("|| 0" avoids a spurious warning.)
      && ($counts{$alt} || 0) >= 2;


CHROM    POS     REF     ALT    10_sample.bam   11_sample.bam   12_sample.bam   13_sample.bam   14_sample.bam   15_sample.bam   16_sample.bam 
tg93    79  C   -   C       C   C   -   -   
tg93    80  G   A   G   G   G   G   A   A   G
tg93    81  A   C   A   A   A   A   C   C   C
tg93    86  C   A   C   C   A   A   A   A   C
tg93    136 G   C   C   G   C   C   G   G   G

您在所需的输出中包含了 108,但它在七个样本中只有一个 ALT 实例。


perl script.pl file.in >file.out


perl -i script.pl file
于 2012-09-18T17:18:18.613 回答


use IO::All;
my $chrom = "tg93";
my @lines = io('file.txt')->slurp;
foreach(@lines) {
    %letters = ();

    # use regex with backreferences to extract data - this method does not depend on tab separated fields
    if(/$chrom\s+\d+\s+([A-Z-\s])\s{3}([A-Z-\s])\s{3}([A-Z-\s])\s{3}([A-Z-\s])\s{3}([A-Z-\s])\s{3}([A-Z-\s])\s{3}([A-Z-\s])\s{3}([A-Z-\s])\s{3}([A-Z-\s])/) {

        # initialize hash counts
        $letters{$1} = 0;
        $letters{$2} = 0;

        # loop through the samples and increment the counter when matches are found
        foreach($3, $4, $5, $6, $7, $8, $9) {
            if ($_ eq $1) {
            if ($_ eq $2) {

        # if the counts for both POS and REF are greater than or equal to 2, print the line
        if($letters{$1} >= 2 && $letters{$2} >= 2) {
            print $_;
于 2012-09-18T18:04:39.927 回答