python - Python：如何有效地检查项目是否在列表中？

Question

我有一个字符串列表（类似单词），并且在解析文本时，我需要检查一个单词是否属于我当前列表的单词组。

但是，我的输入非常大（大约 6 亿行），根据 Python 文档，检查元素是否属于列表是 O(n) 操作。

我的代码是这样的：

words_in_line = []
for word in line:
    if word in my_list:
        words_in_line.append(word)

由于花费了太多时间（实际上是几天），我想改进花费大部分时间的那部分。我查看了 Python 集合，更准确地说，查看了双端队列。但是，只有 O(1) 操作时间访问列表的头部和尾部，而不是中间。

有人知道如何以更好的方式做到这一点吗？

score 19 · Accepted Answer

您可能会考虑使用trie、DAWG或数据库。有几个相同的 Python 实现。

以下是一些相对时间供您考虑一组与列表：

import timeit
import random

with open('/usr/share/dict/words','r') as di:  # UNIX 250k unique word list 
    all_words_set={line.strip() for line in di}

all_words_list=list(all_words_set)    # slightly faster if this list is sorted...      

test_list=[random.choice(all_words_list) for i in range(10000)] 
test_set=set(test_list)

def set_f():
    count = 0
    for word in test_set:
        if word in all_words_set: 
           count+=1
    return count

def list_f():
    count = 0
    for word in test_list:
        if word in all_words_list: 
           count+=1
    return count

def mix_f():
    # use list for source, set for membership testing
    count = 0
    for word in test_list:
        if word in all_words_set: 
           count+=1
    return count    

print "list:", timeit.Timer(list_f).timeit(1),"secs"
print "set:", timeit.Timer(set_f).timeit(1),"secs" 
print "mixed:", timeit.Timer(mix_f).timeit(1),"secs"

印刷：

list: 47.4126560688 secs
set: 0.00277495384216 secs
mixed: 0.00166988372803 secs

即，将一组 10000 个单词与一组 250,000 个单词匹配比在相同 250,000 个单词的列表中匹配包含相同 10000 个单词的列表快 17,085 X。使用源列表和成员测试集比单独使用未排序列表快 28,392 X。

对于成员资格测试，列表是 O(n)，集合和字典是 O(1) 进行查找。

结论：对 6 亿行文本使用更好的数据结构！

score 1 · Accepted Answer

我不清楚您为什么首先选择列表，但这里有一些替代方案：

使用 set() 可能是个好主意。这是非常快的，虽然是无序的，但有时这正是需要的。

如果您需要订购的东西并进行任意查找，您可以使用某种树： http ://stromberg.dnsalias.org/~strombrg/python-tree-and-heap-comparison/

如果在此处设置具有少量误报的成员资格测试或可以接受，您可以检查布隆过滤器： http ://stromberg.dnsalias.org/~strombrg/drs-bloom-filter/

根据您正在做的事情，尝试也可能非常好。

score 0 · Accepted Answer

这使用列表理解

words_in_line = [word for word in line if word in my_list]

这将比您发布的代码更有效，尽管很难知道您的庞大数据集还有多少。

score 0 · Accepted Answer

您可以在此处进行两项改进。

用哈希表支持你的单词列表。当您检查单词列表中是否存在单词时，这将为您提供 O(1) 性能。有很多方法可以做到这一点；在这种情况下最合适的是将列表转换为集合。
为您的匹配词集合使用更合适的结构。
- 如果您需要同时将所有匹配项存储在内存中，请使用 a dequeue，因为它的追加性能优于列表。
- 如果您一次不需要内存中的所有匹配项，请考虑使用生成器。生成器用于根据您指定的逻辑迭代匹配的值，但它一次仅将结果列表的一部分存储在内存中。如果您遇到 I/O 瓶颈，它可能会提供更好的性能。

下面是一个基于我的建议的示例实现（选择生成器，因为我无法想象你一次需要所有这些单词）。

from itertools import chain
d = set(['a','b','c']) # Load our dictionary
f = open('c:\\input.txt','r')
# Build a generator to get the words in the file
all_words_generator = chain.from_iterable(line.split() for line in f)
# Build a generator to filter out the non-dictionary words
matching_words_generator = (word for word in all_words_generator if word in d)
for matched_word in matching_words_generator:
    # Do something with matched_word
    print matched_word
# We're reading the file during the above loop, so don't close it too early
f.close()

输入.txt

a b dog cat
c dog poop
maybe b cat
dog

输出

a
b
c
b

python - Python：如何有效地检查项目是否在列表中？

4 回答 4

Related

Reference