我在网上搜索“弓形虫寄生虫基因组”以找到其中一些基因组文件。我在http://toxodb.org找到了一个我认为很接近的文件,一个名为“TgondiiGenomic_ToxoDB-6.0.fasta”的文件,大小约为 158Mb。我使用以下 pyparsing 表达式来提取基因序列,只用了不到 2 分钟:
fname = "TgondiiGenomic_ToxoDB-6.0.fasta"
fastasrc = open(fname).read() # yes! just read the whole dang 158Mb!
"""
Sample header:
>gb|scf_1104442823584 | organism=Toxoplasma_gondii_VEG | version=2008-07-23 | length=1448
"""
integer = Word(nums).setParseAction(lambda t:int(t[0]))
genebit = Group(">gb|" + Word(printables)("id") + SkipTo("length=") +
"length=" + integer("genelen") + LineEnd() +
Combine(OneOrMore(Word("ACGTN")),adjacent=False)("gene"))
# read gene data from .fasta file - takes just under a couple of minutes
genedata = OneOrMore(genebit).parseString(fastasrc)
(惊喜!一些基因序列包括'N'的运行!这到底是什么?!)
然后我把这个类写成 pyparsing Token 类的子类,用于进行密切匹配:
class CloseMatch(Token):
def __init__(self, seq, maxMismatches=1):
super(CloseMatch,self).__init__()
self.name = seq
self.sequence = seq
self.maxMismatches = maxMismatches
self.errmsg = "Expected " + self.sequence
self.mayIndexError = False
self.mayReturnEmpty = False
def parseImpl( self, instring, loc, doActions=True ):
start = loc
instrlen = len(instring)
maxloc = start + len(self.sequence)
if maxloc <= instrlen:
seq = self.sequence
seqloc = 0
mismatches = []
throwException = False
done = False
while loc < maxloc and not done:
if instring[loc] != seq[seqloc]:
mismatches.append(seqloc)
if len(mismatches) > self.maxMismatches:
throwException = True
done = True
loc += 1
seqloc += 1
else:
throwException = True
if throwException:
exc = self.myException
exc.loc = loc
exc.pstr = instring
raise exc
return loc, (instring[start:loc],mismatches)
对于每个匹配,这将返回一个元组,其中包含匹配的实际字符串,以及不匹配位置的列表。精确匹配当然会为第二个值返回一个空列表。(我喜欢这个类,我想我会把它添加到pyparsing的下一个版本中。)
然后,我运行此代码以在从 .fasta 文件读取的所有序列中搜索“最多 2 个不匹配”匹配项(回想一下,genedata 是 ParseResults 组的序列,每个组包含一个 id、一个整数长度和一个序列字符串):
searchseq = CloseMatch("ATCATCGAATGGAATCTAATGGAAT", 2)
for g in genedata:
print "%s (%d)" % (g.id, g.genelen)
print "-"*24
for t,startLoc,endLoc in searchseq.scanString(g.gene):
matched, mismatches = t[0]
print "MATCH:", searchseq.sequence
print "FOUND:", matched
if mismatches:
print " ", ''.join(' ' if i not in mismatches else '*'
for i,c in enumerate(searchseq.sequence))
else:
print "<exact match>"
print "at location", startLoc
print
print
我从一个基因位中随机提取了搜索序列,以确保我能找到精确匹配,只是出于好奇,看看有多少 1 元素和 2 元素不匹配。
这需要一段时间才能运行。45 分钟后,我得到了这个输出,列出了每个 id 和基因长度,以及找到的任何部分匹配:
scf_1104442825154 (964)
------------------------
scf_1104442822828 (942)
------------------------
scf_1104442824510 (987)
------------------------
scf_1104442823180 (1065)
------------------------
...
我越来越气馁,直到:
scf_1104442823952 (1188)
------------------------
MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAACGGAATCGAATGGAAT
* *
at location 33
MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGAATCGAATGGAAT
*
at location 175
MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGAATCGAATGGAAT
*
at location 474
MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGAATCGAATGGAAT
*
at location 617
MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGAATCGAATAGAAT
* *
at location 718
MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGATTCGAATGGAAT
* *
at location 896
MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGAATCGAATGGTAT
* *
at location 945
最后是我的完全匹配:
scf_1104442823584 (1448)
------------------------
MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGACTCGAATGGAAT
* *
at location 177
MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGAATCAAATGGAAT
*
at location 203
MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCAAATGGAATCGAATGGAAT
* *
at location 350
MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGAATCGAATGGAAA
* *
at location 523
MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCAAATGGAATCGAATGGAAT
* *
at location 822
MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCATCGAATGGAATCTAATGGAAT
<exact match>
at location 848
MATCH: ATCATCGAATGGAATCTAATGGAAT
FOUND: ATCGTCGAATGGAGTCTAATGGAAT
* *
at location 969
因此,虽然这没有创造任何速度记录,但我完成了工作,并且还找到了一些 2 匹配项,以防他们可能感兴趣。
为了比较,这里是一个基于 RE 的版本,它只查找 1 不匹配的匹配项:
import re
seqStr = "ATCATCGAATGGAATCTAATGGAAT"
searchSeqREStr = seqStr + '|' + \
'|'.join(seqStr[:i]+"[ACTGN]".replace(c,'') +seqStr[i+1:]
for i,c in enumerate(seqStr))
searchSeqRE = re.compile(searchSeqREStr)
for g in genedata:
print "%s (%d)" % (g.id, g.genelen)
print "-"*24
for match in searchSeqRE.finditer(g.gene):
print "MATCH:", seqStr
print "FOUND:", match.group(0)
print "at location", match.start()
print
print
(起初,我尝试搜索原始 FASTA 文件源本身,但很困惑为什么与 pyparsing 版本相比匹配项如此之少。然后我意识到有些匹配项必须越过换行符,因为 fasta 文件输出被包装在 n人物。)
因此,在第一次 pyparsing 通过提取要匹配的基因序列之后,这个基于 RE 的搜索器又花了大约 1-1/2 分钟来扫描所有未经过文本包装的序列,以找到所有相同的 1-mismatch 条目pyparsing 解决方案做到了。