0

我有 2 个序列文件。说 ham1.txt :

AAACCCTTTGGG
AGGTACTTTTTT
TCTCTTTTTTTT

等等

火腿2.txt:

AAACCCTTTGGG
GAGAGGGAGGGC
AGGTACTTTTTT
CTCTTAATTTCC
TCTCTTTTTTTT
GTTTTTAAAAAA

我想将 ham1.txt 中的序列与 ham2.txt 中的序列匹配,具体取决于哪对具有最小汉明距离。我的python代码打印了它们之间的汉明距离。我只想要最合适的一对。这是我的代码

def hamming_distance(s1, s2):
    #Return the Hamming distance between equal-length sequences
    if len(s1) != len(s2):
        raise ValueError("Undefined for sequences of unequal length")
    return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))

with open('ham1.txt','r') as file1:
                for s1 in file1:
                        with open('ham2.txt','r') as file2:
                                for s2 in file2:
                                        dist = hamming_distance(s1,s2)
                                        print s1,s2,dist

你能建议编辑吗?谢谢

4

3 回答 3

1

你应该看看itertools.product

In [7]:

L1 = ['AAACCCTTTGGG',
      'AGGTACTTTTTT',
      'TCTCTTTTTTTT']
L2 = ['AAACCCTTTGGG',
      'GAGAGGGAGGGC',
      'AGGTACTTTTTT',
      'CTCTTAATTTCC',
      'TCTCTTTTTTTT',
      'GTTTTTAAAAAA']
def hamming_distance(s1, s2):
    #Return the Hamming distance between equal-length sequences
    if len(s1) != len(s2):
        raise ValueError("Undefined for sequences of unequal length")
    return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))
import itertools
res = [[hamming_distance(*item), item[0], item[1]] for item in itertools.product(L1, L2)]
sorted(res)[0]
Out[7]:
[0, 'AAACCCTTTGGG', 'AAACCCTTTGGG']
于 2014-10-16T00:22:23.817 回答
0

我会使用functools.reduce

from functools import reduce


def hamming_distance(s1, s2):
    #Return the Hamming distance between equal-length sequences
    if len(s1) != len(s2):
        raise ValueError("Undefined for sequences of unequal length")
    return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))

if __name__ == '__main__':
    with open('h1.txt') as f:
        f1 = f.read().splitlines()

    with open('h2.txt') as f:
        f2 = f.read().splitlines()

    for line in f1:
        print(line, reduce(lambda x, y: x if hamming_distance(line, y) > hamming_distance(line, x) else y, f2))

输出:

AAACCCTTTGGG AAACCCTTTGGG
AGGTACTTTTTT AGGTACTTTTTT
TCTCTTTTTTTT TCTCTTTTTTTT
于 2014-10-16T00:37:37.353 回答
0

我生成了以下列表

0 AAACCCTTTGGG AAACCCTTTGGG
0 AGGTACTTTTTT AGGTACTTTTTT
0 TCTCTTTTTTTT TCTCTTTTTTTT
6 AGGTACTTTTTT TCTCTTTTTTTT
6 TCTCTTTTTTTT AGGTACTTTTTT
7 AAACCCTTTGGG AGGTACTTTTTT
7 AGGTACTTTTTT AAACCCTTTGGG
8 AAACCCTTTGGG TCTCTTTTTTTT
8 AGGTACTTTTTT CTCTTAATTTCC
8 TCTCTTTTTTTT AAACCCTTTGGG
8 TCTCTTTTTTTT CTCTTAATTTCC
9 AAACCCTTTGGG GAGAGGGAGGGC
9 TCTCTTTTTTTT GTTTTTAAAAAA
10 AAACCCTTTGGG CTCTTAATTTCC
11 AGGTACTTTTTT GAGAGGGAGGGC
11 AGGTACTTTTTT GTTTTTAAAAAA
12 AAACCCTTTGGG GTTTTTAAAAAA
12 TCTCTTTTTTTT GAGAGGGAGGGC

我想这就是你的需要,对吧?

为了实现这一点,我们使用了几个 liberies。首先,我将数据流/字符串转换为值列表,然后采用 and 的每个可能组合,ham1ham2创建一个还包含汉明值的新列表,然后对它们进行排序。

这对你有帮助吗?否则只要问我会帮助你;)

使用的代码如下。

from distance import hamming
from collections import Counter
from itertools import product

ham1="""
AAACCCTTTGGG
AGGTACTTTTTT
TCTCTTTTTTTT
"""

ham2="""
AAACCCTTTGGG
GAGAGGGAGGGC
AGGTACTTTTTT
CTCTTAATTTCC
TCTCTTTTTTTT
GTTTTTAAAAAA
"""

ham1data = filter(None, ham1.splitlines())
ham2data = filter(None, ham2.splitlines())

res = [(hamming(h1,h2), h1, h2) for h1, h2, in product(ham1data, ham2data)]

for v, h1, h2 in sorted(res):
    print v, h1, h2
于 2014-10-16T00:36:32.453 回答