我认为我的代码效率太低了。我猜这与使用字符串有关,但我不确定。这是代码:
genome = FASTAdata[1]
genomeLength = len(genome);
# Hash table holding all the k-mers we will come across
kmers = dict()
# We go through all the possible k-mers by index
for outer in range (0, genomeLength-1):
for inner in range (outer+2, outer+22):
substring = genome[outer:inner]
if substring in kmers: # if we already have this substring on record, increase its value (count of num of appearances) by 1
kmers[substring] += 1
else:
kmers[substring] = 1 # otherwise record that it's here once
这是搜索长度最多为 20 的所有子字符串。现在这段代码似乎很长时间并且永远不会终止,所以这里一定有问题。在字符串上使用 [:] 会导致巨大的开销吗?如果是这样,我可以用什么代替它?
为了清楚起见,有问题的文件将近 200mb,非常大。