python - 与 anydbm 相比，带有 Python 的 sqlite 慢得不合理

Question

我的目标是创建一个 MD5 哈希数据库，然后检查另一个哈希列表以查看它们是否存在于数据库中。

我有一个使用的工作解决方案anydbm，如果您不熟悉，它的工作方式与 python 中的字典完全一样，但您不必一次将整个内容加载到内存中。数据库的创建速度相当慢（大约 2 个半小时内有 1 亿个条目），检索速度适中（1 到 2 秒内有 100000 个条目）。检查哈希是否存在就像if hash in dbm.

为了提高性能，我尝试使用 sqlite 制作一个可行的解决方案。创建速度极快，在几分钟内创建了全部 1 亿个条目。但检索一个条目需要 15 秒以上。这是不合理的！

我不是 SQL 专业人士，所以我想知道我是否只是在使用过于复杂的命令。

它们如下：

表的创建：c.execute('''create table keys(id integer not null primary key autoincrement, hash text, alert text) ''')

添加条目（在循环中）：c.execute('''insert into keys(hash, alert) values (?,?) ''', (hash, "1"))

检索（也循环）：

c.execute('''select * from keys where hash = ? ''', (hash,))
hits = c.fetchall() 
numhits += len(hits)

score 4 · Accepted Answer

你必须在你的 md5 表上创建一个索引——不管哈希是否唯一——没有索引意味着线性访问在每次访问时都会抛出所有记录。

查看文档，看起来创建索引就像 CREATE INDEX hash ON keys (hash)在数据库上发布一样简单。

python - 与 anydbm 相比，带有 Python 的 sqlite 慢得不合理

1 回答 1

Related

Reference