5

我有两个文件,一个是位置信息,另一个是序列信息。现在我需要读取位置并获取位置处的snp并将该位置基替换为序列中的snp信息并将其写入snp信息文件中..例如

Snp 文件包含

10 A C A/C

序列文件包含

ATCGAACTCTACATTAC

这里第 10 个元素是 T,所以我将用 [A/C] 替换 T,所以最终输出应该是

10 A C A/C ATCGAACTC[A/C]ACATTAC

示例文件是

文件

SNP Ref Alt
10  A   C
19  G   C
30  C   T
42  A   G

序列 :

序列1 CTAGAATCAAAGCAAGAATACACTCTTTTTTTTGGAAAAGAATATCTCATGTTTGCTCTT

最终输出:

SNP Ref Alt Output
10  A   C   CTAGAATCA[A/C]AGCAAGAANACACTCTTTTTTTTGGAAAAGAATATCTCATGTTTGCTCTT
19  G   C   CTAGAATCANAGCAAGAA[C/G]ACACTCTTTTTTTTGGAAAAGAATATCTCATGTTTGCTCTT
30  C   T   CTAGAATCAAAGCAAGAATACACTCTTTT[T/C]TTTGGAAAAGAATATCTCATGTTTGCTCTT
42  A   G   CTAGAATCAAAGCAAGAATACACTCTTTTTTTTGGAAAAGA[A/G]TATCTCATGTTTGCTCTT

在从 Ref 和 A​​lt 列替换此处的 snps 时,我们需要考虑 {A,T,C,G} 的顺序,例如 [Ref/Alt] 始终第一个碱基应该是 A 或 T 或 C,然后是命令。

另一件事是,如果我们取 snp 位置,并且如果有 10 个碱基差异中有任何 snp,我们需要将那个 snp 位置替换为“N”。在上面的示例中,前两个位置的差异为 9,我们将另一个元素替换为“N”。

我编写了代码,它按顺序打印位置,还用该 snp 位置替换序列,但无法读取附近的位置并用 N 替换。

我的方法可能是完全错误的,因为我是编码的初学者。我认为通过使用哈希,我们可以轻松实现这一点,但我对哈希不太熟悉。请提供一些建议。我不必只使用 perl ,

my $input_file = $ARGV[0];
my $snp_file = $ARGV[1];
my $output_file = $ARGV[2];

%sequence_hash = ();

open SNP, $snp_file || die $!;
$indel_count = 0;
$snp_count = 0;
$total_count = 0;

#### hashes and array
@id_array = ();

while ($line = <SNP>) {

    next if $line =~ /^No/;
    $line =~ s/\r|\n//g;


   # if ($line =~ /\tINDEL/) {

    #    $indel_count++;
     #   $snp_type = "INDEL";

    #} else {
     #   $snp_count++;
      #  $snp_type = "SNP";
    #}

    @parts = split(/\t/,$line);

    $id = $parts[0];
    $pos = $parts[1];
    #$ref_base = $parts[3];
    @temp_ref = split(",",$parts[2]);
    $ref_base = $temp_ref[0];
    @alt = split(",",$parts[3]);
    $alt_base = $alt[0];
    $snpformat = '';

    if ($ref_base eq "A" || $alt_base eq "A")
    {

        if ($ref_base eq "A"){
            $snpformat = "[".join("/",$ref_base,$alt_base)."]";}
            #$snpformat = $ref_base/$alt_base;}
            #print "refbase is A $ref_base \t $alt_base \t $snpformat \n"; }

            else 
            {$snpformat = "[".join("/",$alt_base,$ref_base)."]";}
            #print "Altbase is A $ref_base \t $alt_base \t $snpformat \n";}
    }

    elsif ($ref_base eq "T" || $alt_base eq "T")
    {

        if ($ref_base eq "T"){
            $snpformat = "[".join("/",$ref_base,$alt_base)."]";}
            #$snpformat = $ref_base/$alt_base;}
            #print "refbase is A $ref_base \t $alt_base \t $snpformat \n"; }

            else 
            {$snpformat = "[".join("/",$alt_base,$ref_base)."]";}
            #print "Altbase is A $ref_base \t $alt_base \t $snpformat \n";}
    }

    elsif ($ref_base eq "C" || $alt_base eq "C")
    {

        if ($ref_base eq "C"){
            $snpformat = "[".join("/",$ref_base,$alt_base)."]";}
            #$snpformat = $ref_base/$alt_base;}
            #print "refbase is A $ref_base \t $alt_base \t $snpformat \n"; }

            else 
            {$snpformat = "[".join("/",$alt_base,$ref_base)."]";}
            #print "Altbase is A $ref_base \t $alt_base \t $snpformat \n";}
    }


    else 
    {$snpformat = "-/-" ;}
    print " $id \t $pos \t $ref_base \t $alt_base \t $snpformat \n  ";
}

open SEQ, $input_file ||die $!;

$header = '';
$sequence = '';
$num_sequences = 0;

while ($line = <SEQ>) {

    $line =~ s/\r|\n//g;
    next if $line =~ //;

    if ($line =~ /^>(.+)/) {
        if ($header eq '') {

            $header = $1;
            $sequence = '';
            next;
        } else {

            $sequence_len = length($sequence);

            $sequence_hash{$header} = $sequence;
            push (@headers,$header);
            #print $header."\t".$sequence_len."\n";
            #print $sequence."\n";
            $num_sequences++;

            $header = $1;
            $sequence = '';

        }


    } else {

        $sequence .= $line;

    }

}
$sequence_len = length($sequence);
$sequence_hash{$header} = $sequence;
push (@headers,$header);
#print $header."\t".$sequence_len."\n";

$num_sequences++;

close (SEQ);

$pos = '4';
substr($sequence,$pos,1) = "[A/G]";
print $sequence."\n";   
print "$pos \n";
4

3 回答 3

1

awk脚本可能可以帮助您获得所需的结果。

awk '
BEGIN {
print "SNP\tRef\tAlt\tOutput"
}
NR==FNR { 
    a[++i]=$0
    next 
} 
FNR>1 { 
    x=substr(a[i],1,($1-1))
    z=substr(a[i],($1+1))
    if ($2=="A") {
        y="["$2"/"$3"]"
    } 
    else if ($2=="T" && $3=="A") {
        y="["$3"/"$2"]"
    }
    else if ($2=="C" && ($2=="A" || $2=="T")) {
        y="["$3"/"$2"]"
    }
    else if ($2=="G" && ($2=="A" || $2=="T" || $2=="C")) {
        y="["$3"/"$2"]"
    }
    else 
        y="["$3"/"$2"]"
    print $1"\t"$2"\t"$3"\t"x""y""z
}' sequence snp

测试:

[jaypal:~/temp] cat sequence
CTAGAATCAAAGCAAGAATACACTCTTTTTTTTGGAAAAGAATATCTCATGTTTGCTCTT

[jaypal:~/temp] cat snp
SNP Ref Alt
10  A   C
19  G   C
30  C   T
42  A   G

[jaypal:~/temp] awk '
BEGIN {
print "SNP\tRef\tAlt\tOutput"
}
NR==FNR { 
    a[++i]=$0
    next 
} 
FNR>1 { 
    x=substr(a[i],1,($1-1))
    z=substr(a[i],($1+1))
    if ($2=="A") {
        y="["$2"/"$3"]"
    } 
    else if ($2=="T" && $3=="A") {
        y="["$3"/"$2"]"
    }
    else if ($2=="C" && ($2=="A" || $2=="T")) {
        y="["$3"/"$2"]"
    }
    else if ($2=="G" && ($2=="A" || $2=="T" || $2=="C")) {
        y="["$3"/"$2"]"
    }
    else 
        y="["$3"/"$2"]"
        print $1"\t"$2"\t"$3"\t"x""y""z
}' sequence snp
SNP Ref Alt Output
10  A   C   CTAGAATCA[A/C]AGCAAGAATACACTCTTTTTTTTGGAAAAGAATATCTCATGTTTGCTCTT
19  G   C   CTAGAATCAAAGCAAGAA[C/G]ACACTCTTTTTTTTGGAAAAGAATATCTCATGTTTGCTCTT
30  C   T   CTAGAATCAAAGCAAGAATACACTCTTTT[T/C]TTTGGAAAAGAATATCTCATGTTTGCTCTT
42  A   G   CTAGAATCAAAGCAAGAATACACTCTTTTTTTTGGAAAAGA[A/G]TATCTCATGTTTGCTCTT
于 2013-05-28T02:49:06.513 回答
0

我不是 perl 专家,但我认为这样做可以:

#!/usr/bin/perl

open(SEQ, "seq");
my $seq = <SEQ>;
$seq =~ s/.* //;

print "SNP Ref Alt Output\n";
open(SNP, "snp");
<SNP>;# header line
while(<SNP>)
{
    my($line) = $_;
    chomp($line);
    my ($loc, $ref, $alt) = split(/ +/, $line);
    my $outString = $seq;
    substr($outString, $loc-1, 1, "[$ref/$alt]");
    print $loc."  ".$ref."   ".$alt."   ".$outString."\n";
}
于 2012-10-30T08:10:17.560 回答
0
A=1;T=2;C=3;G=4
echo "SNP Ref Alt Output"
while read l1 l2 l3; do
    lp=$(($l1 - 1))
    eval ol2=\$$l2 && eval ol3=\$$l3
    if [[ $ol2 > $ol3 ]]; then 
        ol2=$l3 && ol3=$l2; 
    else 
        ol2=$l2 && ol3=$l3; 
    fi
    sed "s@[^ ]* \(.\{$lp\}\).\(.*\)@$l1  $l2   $l3   \1[$ol2\/$ol3]\2@" sequence
done < snp 

输出

SNP Ref Alt Output
10  A   C   CTAGAATCA[A/C]AGCAAGAATACACTCTTTTTTTTGGAAAAGAATATCTCATGTTTGCTCTT
19  G   C   CTAGAATCAAAGCAAGAA[C/G]ACACTCTTTTTTTTGGAAAAGAATATCTCATGTTTGCTCTT
30  C   T   CTAGAATCAAAGCAAGAATACACTCTTTT[T/C]TTTGGAAAAGAATATCTCATGTTTGCTCTT
42  A   G   CTAGAATCAAAGCAAGAATACACTCTTTTTTTTGGAAAAGA[A/G]TATCTCATGTTTGCTCTT
于 2013-05-28T03:51:49.013 回答