我有一个登录号和 16S rrna 序列的文件,我想做的是删除所有 RNA 行,只保留带有登录号和物种名称的行(并删除中间的所有垃圾)。所以我的输入文件看起来像这样(在登录号前面有 > ):
> D50541 1 1409 1409bp rna Abiotrophia defiva Aerococcaceae
CUGGCGGCGU GCCUAAUACA UGCAAGUCGA ACCGAAGCAU CUUCGGAUGC UUAGUGGCGA ACGGGUGAGU AACACGUAGA UAACCUACCC UAGACUCGAG GAUAACUCCG GGAAACUGGA GCUAAUACUG GAUAAUGGAUAU AGAGAUAAUU UCUUUU...> AY538167 1 1411 1411bp rna Acholeplasma hippikon Acholeplasmataceae
CUGGCGGCGU GCCUAAUACA UGCAAGUCGA ACGCUCUAUA GCAAUAUAGG GAGUGGCGAA CGGGUGAGUA ACACGUAGAU AACCUACCCU UACUUCGAGG AUAACUUCGG GAAACUGGAG CUAAUACUGG AUAGGUCAUA UUGAGAUGCAUC UUA ...
我希望我的输出看起来像这样:
>D50541 Abiotrophia defectiva Aerococcaceae
>AY538167 Acholeplasma hippikon Acholeplasmataceae
我写的代码做了我想要的……对于大多数行。它看起来像这样:
#!/usr/bin/env python
# take LTPs111.compressed fasta and reduce to accession numbers with names.
import re
infilename = 'LTPs111.compressed.fasta'
outfilename = 'acs.fasta'
regex = re.compile(r'(>)\s(\w+).+[rna]\s+([A-Z].+)')
#remove extra letters and spaces
with open(infilename, 'r') as infile, open(outfilename, 'w') as outfile:
for line in infile:
x = regex.sub(r'\1\2 \3', line)
#remove rna sequences
for line in x:
if '>' in line:
outfile.write(x)
有时,代码似乎跳过了一些名称。例如,对于上面的第一个入藏号,我只回来了:
>D50541 气球菌科
为什么我的代码会这样做?每个入藏号的输入看起来相同,并且每行的“rna”和名字之间的间距相同(5 个空格)。
感谢任何可能有想法的人!