0

所以我正在做 cs50 dna 问题,我很难使用计数器,因为我不知道如何编码以正确计算我正在寻找的序列重复的最高次数,而两者之间没有另一个序列. 例如,我正在寻找序列 AAT 并且文本是 AATDHDHDTKSDHAATAAT 所以最高数量应该是两个,因为最后两个序列是 AAT 并且它们之间没有序列。

这是我的代码:

text="TCTAGTCTAGTCTAGTCTAGTCTAGACTTGTCGCTGACTCCGAGAAGATCCTAACATTAACCAATTCCCCCTAGTCTGAGGCACGGTTACCGATCGGGTTAATGGATCTCTCACCGTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAAACGTGTAACTGTAATAATCCGCCCGAAAAAACTGATCTTAGGGTTGCGGCATCTGCACGTGACAGTGTGCTACTGTTAGATAGAGGGATCAAACGAGGTTGCAAGGATTATATCTCTCCGTGCTCGATAAGACACAGCCGGTTGCGGGCTGCTTCCTCTGGATCCAATGCAGCCGTACGTACACCGTAGAGCAAATTTAGTGGTAAAGGAACTTGCTCAAACACTACGGCTTCGGGCTACTGTTGGCGCCGGTTGGGGATCCCATTCAACGCTGGCCCTTTCGCTATGGTTCGGTGATTTTACACCGAAGCGAACCTTGAACCGTGGATTTCGGGTGTCCTCCGTTTTTAGGTACTGCGTGCAGACATGGGCACCTGCCATAGTGCGATCAGCCAGAATCCATTGTATGGGAGTTGGACTCGTTTGAATTTACCGGAAACCTCATGCTTGGTCTGTAGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGAAACTGGGCGACTTGAAGTCGGCTTGCGTATTAATAGCTCTGCAATGTAACTCGGCCCTTGGCGGCGGGCAGCTTAGTATTGAACCGCGACACACCATAGGTGCGGCAAATATTAAAAGTACGCTCGAACCGGAACCTGTCTCCATGACTGGACGACCAGCCCGGCGTCTTCTACGTAACACAGGGGGCTGTCGAGGTAGGGCGTAGGAACTTCGGGGTCACTACGCCGTAACAGCACCGAATATCATATCATCCAACTTGCTTGGTACATGCCCCGTTCTGTATCAAAAGTTTACGGCCCCGGACATACCTGCTGTCAGTTGAATACCTATGCGAGTCTGAAACACGAATAGTTCAGGCGTGCAAAGACACGCTAAGCACACGCCGCAGGCAGGGGGGGTATTTTATAAGTCGTTTTTTGGAAGGGTAATGTAAACTTATCCCATAATACCCTTTGGCTTCCCCTCACTCGTGCACTTCTCATAATGATACGTCAGGGTGATTGTAGATTCACGCGTCATCAGATTGTCCCTTTCTCGAGTCTTAGTATCTTTCCTAATCCGCTCGACTCTGCGCCATGATCGAATTCCTGACAGGCTACAAGAATAAACTGCCAGCATACTCCTTACCGATTGGCGCCTACTAATTATACGCACATGGGCATCTTCGACGTCTAAACATAGGCTCTTAGTATTCCGTAGGATGTTGAGCCGACAGGAAAGTCAAACGTCGTGGGTGACCGTAGCCTGACTCGCCCGACGCAGGATTCGCTCATATGTGTGAACGGATGCTTATGTAACTTCCTAATTGCAGCGAATGGCAGTTCCGTAGTGAAGGTTCGAAACGTACGGGGTCCGGCCATGGATTAGATCTTTCAGTGCGCTAAACTCTTAACCGCAGATACTTGGCGGACCATCTTCGTGTTGCTACTATGGTATAGACCAGGCTGTCGAATCTACTTAACACAGGTGAACCCCCAGATCGGCTAGAGCCTTCGAGGCTAGACCTTTAACAATCTTTAGACACTTCCAAATCGCGGCCGGATATGTCTCGTTGGCAGCCGCAGACAAGAGAAGAGGGTCGGCAGTGTCTGCCACGCGTGACCTGTATGATCTTAGCCTTTAAGATCACACTACTGATCACAATCTATTATGATTGCCTTAGCTAACTGAGTGATGCACCCCCACAGGCTGAGAGAAATCTGTAGTTTGACGACACGCCGTCTGGCTAAAAATGTGAATCCGCCGATCCGAGACGGTGGAAGCTTGAGACCAAATGCGGGAAACCAATGACTTCATTACGGAACAAGACATAACGGCGTGAGTTGACGACTGGGATTAACCCTTTCCCGAGTCTGTACTTCTGCTACACAATGAGGATGCGAATTATCTAAGACCTTGTACTACCTAAACTAACCCTGAGGCGGGCATTGAATTCCGGCCATCTTCAGCCCAAAGAAAGACCAAATGTGAGGAAAATGAGGGATCGGTATAAGCTTTTCACGATCTCAAGGTTCACGGCCGCCAGGGCCGTAGTTGGGGCTTCATGCACATTGCCAACCCGGACATCGACAGTCGGTACCGCAGGGGTTCGAGGAATACTCCCAGCTGTGACACCTGGTCGTCGACTGGACCCAGCTGGTGGGCGGCATAGGTAGTTAATACTGAATTAAAGCCGGGAACGTCTCTCTAACTAGAAACCTTGTGATAGGATACACAGACCTAGTGCCCCGACGTTAGCATTTGAATTCATCTATCTTGGCGTCTTTTAGTAGGCCTGGGTCAACTCCGGCGTTGGCCAAAATAACCGATCTGCGTTATGTGGCCACGCATCGAGTGACAGGGTGCATACAAATTGATGGTCAAAGAGTTTAAACAAGACAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCCCCACGCTTCTACATAGCCACACTGGAGCTAGTCCTCGTGTTAAATTTTTCGCTTGTTGCACGGTTATCATCAGAAGTGCCACTGGTATTCCTCTGTAGCTCCCGTATGCCGAAGGTTGCGGCTTAGGTACTGCTTATACACGTCTCTCAAGTTTGTCAGCCGCGTGATCTTTCTGCGGGGATAGGTGATCGTCCCTCGCTCCGGACATTGCATTAAAATTACCTAGTTGATAGGGCGGCGGAGTTGCATACCGGCGTTCAATCGCGGCTCCAGACTGGTTTGAGCTACGCGTCTGCCAGCGTGAAAAAGCTGATTTGTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCCAGGTATTATCATTTGAATCGTATGTTTTCTGCCGTACGTCAACTGCGTCGTCGGGGACTGAAATGGTCTGCCTCCAGACCCTTACCTCCCGATAAGCCATGACTAAGTATGTGAAGGATCACCTGAATTGCTGAAAGTTAACGGTAAGATATCTGAAAGAGCTCATTAGATCCAACACTTATCTACTCAAAAATTCGTCATATTTCGGTGACTTGCTAGAAAGGCTCTTGCACAGTAAGGTTATAGAGAATGCTACCGTTGAAGCACCAGCCGTTGAAGCCCGCCTTTAACCACGCGATATATCCAATTAACCAAGGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGTCGCCTTGTAATAATTACTTTGGCCCGGATTATAACGAAGGAACTCGCCATGAACTCGCAGCACGTTGTACTGGAACAATCTACTTTTTATAATATAGCGATAACTCCCAGCTTTTATGTGGGTGATATTGTCCTAGCTTTTTAAAGATACCCTCTGGCCCGGTCCAAGTAAGGTCCACATTGCCTGACGTAAGCGTACGGTCAACGGGTGCACCGGTTCCCGCTAAAGCTCGATCCTATTCTTTCAGTCGGGGGGAAATAAACTCGTATACTCTCCACCCACCCGTACGTCCCGGACTAGAATAACTACCGGGTATTTCCGGTTCGTAACACCACGCCATGACGTGTCAACATAAACGCTTCTTTTGAAAGGTGCACATGCAGATTGCACAAGCAGCAGGCACCGCCCTTATCCATATCCTGTTGAGGCCCTCGATCCTAGTGTTCCTTGTTATCAGGATATTTTCTCGCTGTACGTTATTGTCCTTTTCAAATTACAACTGACCGCTTCCTCACCCGCTAAACCCTACCTTACGCACAACCAAGGCCTTGTCCCGGATGAACCCGGCTGCTCCTATGGATAAGCAACCCAGCCCGGCAGTTTACTTCAGGTGTTATCGTCGACTGACACCCTCAGCTTTCTCCCATTACACAGCGAGTATTTTCCGCGTAGCAATGGCAGTGACTTTGAGCGCACACTCAGAAGCCGTTGGAATGGCACCGGGGACGGCCCGATTTAGCCCCGCACACCTCCTGGAATCTTAGATCGCACGGCGATCTCGGTTCAGGCACCAACCCCAAAGAGTGTTTTGAGTTTTTGGTATGGCTCGCCTCAATTATCGGTTTTCGCTGCTCTGTGCCTGTCAACTCGGCTAGCTGTCGTGTTTTGTCGATCAGTGCGTGGACACTCTCGGTCGATGGTCGTGGATGGGACTGTAGTAAGTTTCACCGAAGCAGGAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAACTTCGCTTCATATAACGTAGCCATAGTGCTGTCTGCCATCAATAAGTCTTGCTCAGTGGTGCATACGTCGGGGAGGTTTGTTCCGCCTGGTCAGAACGAGTCTAGGGCGAGCCTATAGGCCAGTCGAGAGCCAAGATTCTATGAAATTAATACGACTACTGGGTGAGAGGTCATACAATTCCCGTGGAATCTGTACCTAAGATATTTCCAGATAGGGATGGCTACTGGTTAAGTTGACAGTGTCTAGATACGTGAGAGCACCTGAGAGGACGCCACGAGTCGGAGCGTGGGCGATCACCCTTCTGAGTCATAAGTCATGTCTATATATCCCTCACTAAAAAGGGCACACGACTATACATGCTTGAGCTTTACGGTCTGGCATGTGGAATGCCCGGAGCAACCCAGTCTTACCATCCTTTACGTACATTTACCGACCCGGCAGTGGCCGGCGCGGAAACCCAGGAGAACGTCGGTCATGATACGCGCCCTCCGCCGAAAGCGTGCTCACACCTCAGGATATCAGCGCTATTACCGGACGTCCCGCGTCCACCATCTAATAATTCAGGTGCTCCTAATAAGTGGGCTGGAGAGCGAGGATTGATATACGTTGAGGAGCTCCGACGGCCCTCTCGTGCGTTTGATGTAGATTGCGTTACCGACGGAGCACGCGTTTGTCAATTTCTGTCTAGGGACGTTTATGTCCTCAATACGAATACCAGGCCTATTTTAGTGTACAAATCACTTAGCAGTCGGAATTGGAAACCTGATGGAAGCGT"
counter=0
length=len(text)
search="AGATC"
tmp=0
for i in range(length):
    if text[i:i + len(search)] == search:
        tmp += 1
        if tmp > counter:
            counter = tmp
    if text[i:i + len(search)] != search:
        tmp = 0


print("done")
print(counter)
4

2 回答 2

1

尝试这个

import re
sequence = "AATDHDHDTKSDHAATAAT"
matches = re.findall(r'(?:AAT)+', sequence)
largest = max(matches, key=len)
print(len(largest)//len('AAT'))

基本上这种方式会找到你拥有的字符串中的子字符串列表,然后你选择最大的子字符串。子串的出现次数将是最大的长度除以子串的长度

于 2020-11-15T06:14:06.197 回答
0

首先,regex解决方案是解决这个问题的 Python 方法。但是,如果你想修复你的代码......

您的代码的问题是您的索引无法确认您已找到匹配项。您无法识别连续出现的事件。

考虑一下您发现三重匹配开始的情况,AATAATAAT. 你得到第一个A,认识,AAT和增量tmp。您进入下一个循环迭代,现在i指向第二个A. 您会看到它不在 AAT这里(它是ATA,跨越前两次出现),因此您记录了一个实例并重置所有状态变量。

相反,您必须跳到第一场比赛的结尾并寻找第二场比赛。由于您的索引不会1 为增量平稳移动,因此您需要一个while循环。

请学习在变量有任何意义的情况下使用有意义的变量名。 i如果它所做的只是管理你的循环,那很好。一旦你将它用于其他任何事情,就给它一个真实的名字。同样, tmp确实count 需要更换。

snip_size = len(search)
pos = 0      # position in the genetic sequence
rep = 0      # number of consecutive repetitions
max_rep = 0  # longest repetition sequence found

while pos < length:
    if text[pos:pos + snip_size] == search:
        rep += 1
        pos += snip_size
    else:
        max_rep = max(max_rep, rep)
        rep = 0
        pos += 1

print(max_rep, "repetitions found")

输出:

15 repetitions found
于 2020-11-15T06:33:52.283 回答