0

我有一个包含多个序列的 FASTA 文件,如下所示:

> AT1G01250.1 | Symbols: | Integrase-type DNA-binding superfamily protein
MSPQRMKLSSPPVTNNEPTATASAVKSCGGGGKETSSSTTRHPVYHGVRKRRWGKWVSEI
REPRKKSRIWLGSFPVPEMAAKAYDVAAFCLKGRKAQLNFPEEIEDLPRPSTCTPRDIQV
AAAKAANAVKIIKMGDDDVAGIDDGDDFWEGIELPELMMSGGGWSPEPFVAGDDATWLVD
GDLYQYQFMACL

> AT1G03800.1 | Symbols: ERF10, ATERF10 | ERF domain protein 10
MTTEKENVTTAVAVKDGGEKSKEVSDKGVKKRKNVTKALAVNDGGEKSKEVRYRGVRRRP
WGRYAAEIRDPVKKKRVWLGSFNTGEEAARAYDSAAIRFRGSKATTNFPLIGYYGISSAT
PVNNNLSETVSDGNANLPLVGDDGNALASPVNNTLSETARDGTLPSDCHDMLSPGVAEAV
AGFFLDLPEVIALKEELDRVCPDQFESIDMGLTIGPQTAVEEPETSSAVDCKLRMEPDLD
LNASP

我想从一个范围内的文件中提取不同序列的不同部分,如下所示:

AT1G01250   45  102
AT1G03800   65  109

我找到了一个基于 Python 的程序atgc-tools,但是对任何输入文件格式都非常挑剔,因此对于大型数据集来说并不方便。任何人都可以建议一个基于 Perl 的解决方案。?

4

1 回答 1

0

这个任务很容易用Biopython完成:

from Bio import SeqIO


records = SeqIO.parse('sequences.fasta', 'fasta')
record_dict = SeqIO.to_dict(records)  # dict of records, accessible by accession

ranges_by_accession = {'AT1G01250.1': (45, 102),
                       'AT1G03800.1': (65, 109)}

for accession, (start, end) in ranges_by_accession.items():
    record = record_dict[accession]
    subseq_record = record[start:end]  # For the full record (with header)
    subseq = record.seq[start:end]  # For only the subsequence
    print subseq

如果您更喜欢 Perl,我建议您查看BioPerl及其相应的模块SeqIO

于 2013-10-04T17:39:09.647 回答