我正在尝试根据前 5 个和后 5 个字符对齐匹配序列。因此,例如:
AAATGCEGAIRPVOGJKKK
KKKTGKAFKEJWKLJFFFF
FFFKEJFWKFJWEKFJIGK
将对齐并连接以创建:
AAATGCEGAIRPVOGJKKKTGKAFKEJWKLJFFFFKEJFWKFJWEKFJIGK
请注意,映射区域不应重复。我实际上有超过 3 行,并且它们没有排序,因此我试图创建一个循环以将所有它们对齐在一起。我不确定解决这个问题的最佳方法。
我正在尝试根据前 5 个和后 5 个字符对齐匹配序列。因此,例如:
AAATGCEGAIRPVOGJKKK
KKKTGKAFKEJWKLJFFFF
FFFKEJFWKFJWEKFJIGK
将对齐并连接以创建:
AAATGCEGAIRPVOGJKKKTGKAFKEJWKLJFFFFKEJFWKFJWEKFJIGK
请注意,映射区域不应重复。我实际上有超过 3 行,并且它们没有排序,因此我试图创建一个循环以将所有它们对齐在一起。我不确定解决这个问题的最佳方法。
我认为你可以通过制作一个前缀字典来解决这个问题。一旦你有了它,从一个任意序列开始,发现它的后缀在前缀列表中。然后转到具有该前缀的序列,构建一个链。
这是一些代码:
def make_circular_overlapping_sequence(sequences, min_overlap=1, max_overlap=5):
# start by mapping prefixes to full sequences
prefixes = {}
for seq in sequences:
for length in range(min_overlap, max_overlap+1):
prefixes[seq[:length]] = seq
# pick arbitrary a start sequence
start = current = sequences[0]
# build a chain of sequences with overlapping suffixes and prefixes
chain = [start]
while True:
# try longest suffixes first
for length in range(max_overlap, min_overlap-1, -1):
suffix = current[-1-length:]
if suffix in prefixes:
current = prefixes[suffix]
if current == start: # looped around, so we're done
return "".join(chain)
chain.append(current[length+1:]) # don't duplicate the prefix
break
else: # for loop ended without breaking
raise ValueError("No match found for sequence {!r}"
.format(current))
测试输出:
>>> sequences = '''AAATGCEGAIRPVOGJKKK
KKKTGKAFKEJWKLJFFFF
FFFKEJFWKFJWEKFJIGK
GKXYZ1234AAAT'''.split()
>>> make_circular_overlapping_sequence(sequences)
'AAATGCEGAIRPVOGJKKKTGKAFKEJWKLJFFFFKEJFWKFJWEKFJIGKXYZ1234AAAT'
笔记:
"...ABC"
,后面可能跟一个"ABC..."
or "BC..."
),则可能无法正常工作。目前它总是选择最长的重叠。["A...B", "B..C", "C..B"]
),它可能会永远运行。return "".join(chain)[:-length]