python - 在文本中搜索一长串子字符串

Question

问题：给定一组约 250000 个整数用户 ID，以及大约 1 TB 的 JSON 格式的单行记录，将用户 ID 匹配的记录加载到数据库。

只有大约 1% 的记录与 250000 个用户 ID 匹配。我尝试使用字符串匹配来确定用户 ID 是否在原始 JSON 中，而不是 JSON 解码每条记录，这需要很长时间；如果匹配，则解码 JSON 并检查记录然后插入。

问题在于将一个原始 JSON 字符串与包含约 250k 字符串条目的集合进行匹配很慢。

这是到目前为止的代码：

// get the list of integer user IDs
cur.execute('select distinct user_id from users')

// load them as text into a set
users = set([])
for result in cur.fetchall():
    users.add(str(result[0]))

// start working on f, the one-json-record-per-line text file
for line in f:
    scanned += 1
    if any(user in line for user in users):
        print "got one!"
        // decode json
        // check for correct decoded user ID match
        // do insert

我正在以正确的方式接近这个吗？匹配这些字符串的更快方法是什么？目前，在查找这么多用户 ID 时，这在 3ghz 机器上每秒管理约 2 个条目（不太好）。当用户 ID 列表非常短时，它管理约 200000 个条目/秒。

score 3 · Accepted Answer

Aho-Corasick似乎是为此目的而建造的。甚至还有一个方便的 Python 模块（easy_install ahocorasick）。

import ahocorasick

# build a match structure
print 'init empty tree'
tree = ahocorasick.KeywordTree()

cur.execute('select distinct user_id from users')

print 'add usernames to tree'
for result in cur.fetchall():
   tree.add(str(result[0]))

print 'build fsa'
tree.make()

for line in f:
     scanned += 1
     if tree.search(line) != None:
         print "got one!"

这接近每秒约 450 个条目。

score 0 · Accepted Answer

尝试反转匹配算法：

for digit_sequence in re.findall('[0-9]+', line):
    if digit_sequence in users:
        ...

score 0 · Accepted Answer

我是 C++ 自由职业者，我的客户通常是一些初创公司，它们有一些缓慢的 python/java/.net/etc 代码，他们希望它运行得更快。我通常可以让它快 100 倍。就在最近，我有类似的问题任务：在 TB 的文本数据中实现 500 万个子字符串的搜索。

我测试了几种算法。对于 Aho–Corasick，我使用了开源http://sourceforge.net/projects/multifast/。这不是最快的算法。最快的是我的算法，它是我从混合哈希表和从 Rabin–Karp 搜索算法中提取的一些想法中炮制出来的。这个简单的算法比 AC 快了 5 倍，使用的内存也少了 5 倍。平均子字符串长度为 32 个字节。因此，AC 可能不是最快的算法。

python - 在文本中搜索一长串子字符串

3 回答 3

Related

Reference