python - 在 python 中的 DNA 字符串中引入突变

Question

例如，给定一个 DNA 字符串AGC。我正在尝试生成所有可能的 uniq 字符串，允许给定字符串中最多 #n （用户定义的数字）不匹配。

我可以通过以下方式对一个不匹配执行此操作，但无法实现递归解决方案以基于#n 不匹配、DNA 字符串和突变集（AGCTN）生成所有可能的组合

temp_dict = {}
sequence = 'AGC'

    for x in xrange(len(sequence)):
        prefix = sequence[:x]
        suffix = sequence[x+1:]
        temp_dict.update([ (prefix+base+suffix,1) for base in 'ACGTN'])
    print temp_dict

一个例子：

对于给定的示例字符串：ACG，以下是 13 个 uniq 序列，最多允许一个不匹配

{'ACC': 1, 'ATG': 1, 'AAG': 1, 'ANG': 1, 'ACG': 1, 'GCG': 1, 'AGG': 1, 
'ACA': 1, 'ACN': 1, 'ACT': 1, 'TCG': 1, 'CCG': 1, 'NCG': 1}

我想概括这一点，以便程序可以采用 100 个字符长的 DNA 字符串并返回允许用户定义的 #mismatches 的 uniq 字符串列表/字典

谢谢！-阿比

score 6 · Accepted Answer

假设我了解您，我认为您可以使用该itertools模块。基本思想是使用选择不匹配的位置，combinations然后使用构造所有令人满意的列表product：

import itertools

def mismatch(word, letters, num_mismatches):
    for locs in itertools.combinations(range(len(word)), num_mismatches):
        this_word = [[char] for char in word]
        for loc in locs:
            orig_char = word[loc]
            this_word[loc] = [l for l in letters if l != orig_char]
        for poss in itertools.product(*this_word):
            yield ''.join(poss)

对于您的示例案例：

>>> mismatch("ACG", "ACGTN", 0)
<generator object mismatch at 0x1004bfaa0>
>>> list(mismatch("ACG", "ACGTN", 0))
['ACG']
>>> list(mismatch("ACG", "ACGTN", 1))
['CCG', 'GCG', 'TCG', 'NCG', 'AAG', 'AGG', 'ATG', 'ANG', 'ACA', 'ACC', 'ACT', 'ACN']

score 3 · Accepted Answer

我相信接受的答案只会给出N不匹配，而不是最多 N。我认为对接受的答案稍作修改应该可以纠正这个问题：

from itertools import combinations,product

def mismatch(word, i = 2):
  for d in range(i+1):
    for locs in combinations(range(len(word)), d):
      thisWord = [[char] for char in word]
      for loc in locs:
        origChar = word[loc]
        thisWord[loc] = [l for l in "ACGT" if l != origChar]
      for poss in product(*thisWord):
        yield "".join(poss)    

kMerList = list(mismatch("AAAA",3))

print kMerList

我对编程完全陌生，所以如果我错了，请纠正我。

python - 在 python 中的 DNA 字符串中引入突变

2 回答 2

Related

Reference