-2

是否可以多次遍历列表?基本上,我有一个字符串列表,我正在寻找最长的超字符串。列表中的每个字符串都有至少一半长度的重叠,并且它们的大小都相同。我想看看我添加到的超字符串是从列表中的每个序列开始还是结束,当我找到一个匹配项,我想将该元素添加到我的超字符串中,从列表中删除该元素,然后一次又一次地循环它,直到我的列表为空。

sequences=['ATTAGACCTG','CCTGCCGGAA','AGACCTGCCG',''GCCGGAATAC']
halfway= len(sequences[0])/2
genome=sequences[0]     # this is the string that will be added onto throughout the loop
sequences.remove(sequences[0]) 


for j in range(len(sequences)):
    for sequence in sequences:
        front=[]
        back=[]
        for i in range(halfway,len(sequence)):

            if genome.endswith(sequence[:i]):
                genome=genome+sequence[i:] 
                sequences.remove(sequence)

            elif genome.startswith(sequence[-i:]):
                genome=sequence[:i]+genome  
                sequences.remove(sequence)
'''
            elif not genome.startswith(sequence[-i:]) or not genome.endswith(sequence[:i]):

                sequences.remove(sequence)      # this doesnt seem to work want to get rid of 
                                                #sequences that are in the middle of the string and 
                                                 #already accounted for 
'''

当我不使用最终的 elif 语句并给我正确的答案 ATTAGACCTGCCGGAATAC 时,这有效。但是,当我使用更大的字符串列表执行此操作时,我仍然会在列表中留下我希望为空的字符串。如果我只是在寻找要添加到超字符串前后的字符串(我的代码中的基因组),那么最后一个循环也是必要的。

4

2 回答 2

0

这就是我最终解决它的方法,我意识到你需要做的就是找出哪个字符串是超字符串的开头,因为我们知道序列有 1/2 或更多的重叠我发现哪一半不是t 包含在任何序列中。从这里我循环列表的次数等于列表的长度,并寻找基因组的结尾与适当序列的开头相匹配的序列。当我发现这一点时,我将序列添加到基因组(超字符串)中,然后删除了该序列并继续遍历列表。当使用 50 个长度为 1000 的序列的列表时,此代码大约需要 .806441 才能运行

def moveFirstSeq(seqList): # move the first sequence in genome to the end of list 
    d={}
    for seq in seqList:
        count=0
        for seq1 in seqList:

            if seq==seq1:
                pass
            if seq[0:len(seq)/2] not in seq1:
                count+=1
                d[seq]= count

    sorted_values=sorted(d.values())
    first_sequence=''
    for k,v in d.items():
        if v==sorted_values[-1]:
            first_sequence=k
            seqList.remove(first_sequence)

            seqList.append(first_sequence)

    return seqList


seq= moveFirstSeq(sequences)  
genome = seq.pop(-1)   # added first sequence to genome and removed from list 

for j in range(len(sequences)):   # looping over the list amount of times equal to the length of the sequence list  
    for sequence in sequences:

        for i in range(len(sequence)/2,len(sequence)):

            if genome.endswith(sequence[:i]):
                genome=genome+sequence[i:]  # adding onto the superstring and 
                sequences.remove(sequence) #removing it from the sequence list 

print genome , seq 
于 2018-02-14T18:49:53.607 回答
0

尝试这个:

sequences=['ATTAGACCTG','CCTGCCGGAA','AGACCTGCCG','GCCGGAATAC']
sequences.reverse()
genome = sequences.pop(-1)     # this is the string that will be added onto throughout the loop

unrelated = []

while(sequences):
    sequence = sequences.pop(-1)
    if sequence in genome: continue
    found=False
    for i in range(3,len(sequence)):
        if genome.endswith(sequence[:i]):
            genome=genome+sequence[i:]
            found = True
            break
        elif genome.startswith(sequence[-i:]):
            genome=sequence[:i]+genome  
            found = True
            break
    if not found:
        unrelated.append(sequence)

print(genome)
#ATTAGACCTGCCGGAATAC
print(sequences)
#[]
print(unrelated)
#[]

我不知道你是否保证在同一批次中没有多个不相关的序列,所以我允许不相关的。如果这不是必需的,请随时删除。

Python对从a前面删除的处理list效率很低,所以我把列表倒过来从后面拉。根据完整数据(与您的示例数据一起),可能不需要反转。

sequences list当有可用的序列可以避免list在迭代时删除元素时,我会弹出。然后我检查它是否已经在最终的基因组中。如果不是,那么我会检查endswith/beginswith检查。如果找到匹配项,将其切入基因组;设置找到的标志;跳出for循环

如果该序列尚未包含并且未找到部分匹配,则将其放入unrelated

于 2018-02-13T23:11:24.487 回答