performance - 检查一个巨大列表中的流媒体项目

Question

这是我们试图解决的问题。我们正在处理大量项目的大量流数据。我们还有一个预定义的项目列表。我们需要检查流中的每个项目是否属于我的预定义列表，该列表非常庞大（大约 400 万个项目）。查找/检查操作应尽可能高效

如果这里的人们可以帮助我指出我可以阅读的论文/算法以正确的方式解决这个问题，那就太好了。

谢谢，

score 0 · Accepted Answer

在找到解决方案之前，您可能希望缩小一些假设范围

您的预定义列表是否适合主内存？
访问模式会是什么样子？大多数流式传输的项目是相同的类型，还是所有类型都被平等地表示？
您的物品是否很小（整数/短字符串）？如果没有，是否每个都附有唯一标识符？
预定义列表是静态的还是会改变？多频繁？

一般来说，您需要维护对象的哈希表（或表示这些对象的唯一键）并在每个对象通过您的流进入时查找它。哈希表提供快速查找，如果您的数据集是静态的，它们非常适合您描述的用例。但是，在某些情况下，其他解决方案可能会表现得更好，上述问题应该揭示是否是这种情况。

为了进一步阅读，我将向您介绍 Wikipedia 上的Data Structures和Big-O notation 文章

编辑：我编写了这个快速程序来测量 python 中的哈希查找性能：

#!/usr/bin/python

import random
import string
import time

# return a random username, all lowercase, between n and m characters long
def generate_username(min_length = 5, max_length = 10):
    n = random.randint(min_length, max_length)
    return ''.join(random.sample(string.ascii_lowercase, n))

# Build a hash table of 4mil usernames (the 'predefined items')
users = set()
for i in xrange(4000000):
    users.add(generate_username())

# Build a 'stream' of usernames to check against the hash table
stream = []
for i in xrange(10000000):
    stream.append(generate_username())

# Measure performance of hash lookups for 10mil usernames
start = time.clock()
for name in stream:
    if name in users:
        pass #print "%s is present" % name
end = time.clock()

print "%fs (%f - %f)" % (end - start, start, end)

结果：

3.660000s (238.100000 - 241.760000)

因此，在 Python 中，您可以在 4 秒内检查 1000 万个用户名，相当于流式传输 >17MB/s。你真的需要多快？:)

performance - 检查一个巨大列表中的流媒体项目

1 回答 1

Related

Reference