您可能会考虑使用trie、DAWG或数据库。有几个相同的 Python 实现。
以下是一些相对时间供您考虑一组与列表:
import timeit
import random
with open('/usr/share/dict/words','r') as di: # UNIX 250k unique word list
all_words_set={line.strip() for line in di}
all_words_list=list(all_words_set) # slightly faster if this list is sorted...
test_list=[random.choice(all_words_list) for i in range(10000)]
test_set=set(test_list)
def set_f():
count = 0
for word in test_set:
if word in all_words_set:
count+=1
return count
def list_f():
count = 0
for word in test_list:
if word in all_words_list:
count+=1
return count
def mix_f():
# use list for source, set for membership testing
count = 0
for word in test_list:
if word in all_words_set:
count+=1
return count
print "list:", timeit.Timer(list_f).timeit(1),"secs"
print "set:", timeit.Timer(set_f).timeit(1),"secs"
print "mixed:", timeit.Timer(mix_f).timeit(1),"secs"
印刷:
list: 47.4126560688 secs
set: 0.00277495384216 secs
mixed: 0.00166988372803 secs
即,将一组 10000 个单词与一组 250,000 个单词匹配比在相同 250,000 个单词的列表中匹配包含相同 10000 个单词的列表快 17,085 X。使用源列表和成员测试集比单独使用未排序列表快 28,392 X。
对于成员资格测试,列表是 O(n),集合和字典是 O(1) 进行查找。
结论:对 6 亿行文本使用更好的数据结构!