1

我需要从这种格式转换一个 FASTA 标头:

gi|351517969|参考|NW_003613580.1| Cricetulus griseus 未放置的基因组支架,CriGri_1.0 scaffold329,全基因组鸟枪序列

对此:

NW_003613580.1 Cricetulus griseus 未放置的基因组支架,CriGri_1.0 scaffold329,全基因组鸟枪法序列

NW 中的 W 可以是其他地址中的 C,下划线后的位数不同。

我找到了一个 perl 脚本来将 ID 更改为不同的格式,并试图对其进行修改。相关部分:

    while( $seq = $seq_in->next_seq() ) 
{
    my $seqName = $seq->id;
    $seqName =~ s/\|/\./g; #replace pipe with dot

        $seqName =~ s/(NW\_)/$1/;   

        #$seqName =~ s/(gi\.\w*)\..*/$1/; 

        $seq->id($seqName);
    $seq_out->write_seq($seq);
}

注释掉的 seqname 位是原始的。我希望将 gi 更改为 NW 会使它在标题中稍后开始阅读,但没有骰子。但是,将 $1 更改为随机文本确实会在 NW 处替换它,所以我不太确定。此外,更换管道的时期似乎没有任何合乎逻辑的理由消失(尽管我确实希望它们消失)。任何帮助,或者至少一些关于搜索和替换如何在这里工作的资源将不胜感激。

4

6 回答 6

3

拆分组件:

my @fastaHeaderComponents = split("\\|", $seq->id);

然后访问它们:

my $accessionId = $fastaHeaderComponents[3];
my $description = $fastaHeaderComponents[4];

并重建标题:

my $newFastaHeader = ">$accessionId $description";
$seq->id($newFastaHeader);
于 2012-12-17T22:13:56.500 回答
3

单线sed

sed -r 's/^([^|]+\|){3}//;s/\|//' file

NW_003613580.1 Cricetulus griseus 未放置的基因组支架,CriGri_1.0 scaffold329,全基因组鸟枪法序列

解决方案的sed好处是您可以指定在哪一行进行替换,例如仅使用第一行1s并使用-i选项将替换存储回文件:

sed -ri '1s/^([^|]+\|){3}//;1s/\|//' file

正则说明:

s/     # Substitution, 1s/ first line only, 2s/ second line..
^      # Match the start of the line
(      # Group pattern
[^|]+  # Match one or more character that isn't a |
\|     # Match the | (escaped)
)      # End grouped pattern
{3}    # Repeat grouped pattern 3 times
/      # Replace with 
/      # Nothing
;
s/     # Substitute, 1s/ first line only..
\|     # The remaining |
/      # Replace with
/      # Nothing 
于 2012-12-17T22:14:19.370 回答
2

也许以下内容会有所帮助:

use strict;
use warnings;
use Bio::SeqIO;

my $seq_in  = Bio::SeqIO->new( -file => 'input.fas',   '-format' => 'Fasta' );
my $seq_out = Bio::SeqIO->new( -file => '>output.fas', '-format' => 'Fasta' );

while ( my $seq = $seq_in->next_seq ) {
    my $shortened_seq = Bio::Seq->new(
        -desc       => $seq->desc,
        -display_id => ( split /\|/, $seq->id )[-1]
    );

    $seq_out->write_seq($shortened_seq);
}

给定如下的 FASTA 标头作为输入:

>gi|351517969|ref|NW_003613580.1| Cricetulus griseus unplaced genomic scaffold, CriGri_1.0 scaffold329, whole genome shotgun sequence

它产生以下输出:

>NW_003613580.1 Cricetulus griseus unplaced genomic scaffold, CriGri_1.0 scaffold329, whole genome shotgun sequence
于 2012-12-18T02:08:28.477 回答
1

This is just a matter of splitting the original header on pipe characters (surrounded by optional whitespace) and rejoing joining the required fields again

use strict;
use warnings;

my $header = 'gi|351517969|ref|NW_003613580.1| Cricetulus griseus unplaced genomic scaffold, CriGri_1.0 scaffold329, whole genome shotgun sequence';

$header = join ' ', (split /\s*\|\s*/, $header)[3,4];

print $header;

output

NW_003613580.1 Cricetulus griseus unplaced genomic scaffold, CriGri_1.0 scaffold329, whole genome shotgun sequence
于 2012-12-17T22:17:07.773 回答
1

简短版本:将序列拆分为一个数组,使用split.

my @parts = split( /\|/, $seq );

然后使用数组的元素构建一个要显示的字符串。

print $parts[3], ' ', $parts[4], etc....
于 2012-12-17T22:15:00.633 回答
0

这可能对您有用(GNU sed):

sed -r 's/^([^|]*\|){3}(N[WC]_[0-9.]+)\|/\2/' file
于 2012-12-18T01:04:53.990 回答