python - 大字典的搁置太慢，我可以做些什么来提高性能？

Question

我正在使用 python 存储一个表，我需要持久性。

本质上，我将表格存储为数字的字典字符串。整个用搁板存放

self.DB=shelve.open("%s%sMoleculeLibrary.shelve"%(directory,os.sep),writeback=True)

我习惯writeback了，True因为我发现如果我不这样做，系统往往会不稳定。

计算完成后，系统需要关闭数据库，并将其存储回来。现在数据库（表）大约是 540MB，而且需要很长时间。表增长到大约 500MB 后，时间爆炸式增长。但我需要一张更大的桌子。事实上，我需要其中的两个。

我可能使用了错误的持久性形式。我可以做些什么来提高性能？

score 14 · Accepted Answer

对于存储大型string : number键值对字典，我建议使用 JSON 原生存储解决方案，例如MongoDB。它有一个很棒的 Python API，Pymongo。MongoDB 本身是轻量级的，而且速度非常快，而且 json 对象本身就是 Python 中的字典。这意味着您可以使用您的string密钥作为对象 ID，从而实现压缩存储和快速查找。

作为代码简单程度的示例，请参见以下内容：

d = {'string1' : 1, 'string2' : 2, 'string3' : 3}
from pymongo import Connection
conn = Connection()
db = conn['example-database']
collection = db['example-collection']
for string, num in d.items():
    collection.save({'_id' : string, 'value' : num})
# testing
newD = {}
for obj in collection.find():
    newD[obj['_id']] = obj['value']
print newD
# output is: {u'string2': 2, u'string3': 3, u'string1': 1}

您只需要从 unicode 转换回来，这很简单。

score 11 · Accepted Answer

Based on my experience, I would recommend using SQLite3, which comes with Python. It works well with larger databases and key numbers. Millions of keys and gigabytes of data is not a problem. Shelve is totally wasted at that point. Also having separate db-process isn't beneficial, it just requires more context swaps. In my tests I found out that SQLite3 was the preferred option to use, when handling larger data sets locally. Running local database engine like mongo, mysql or postgresql doesn't provide any additional value and also were slower.

score 3 · Accepted Answer

我认为您的问题是由于您使用writeback=True. 文档说（重点是我的）：

由于 Python 语义，架子无法知道何时修改了可变的持久字典条目。默认情况下，修改的对象仅在分配到架子时才写入（参见示例）。如果可选的 writeback 参数设置为 True，则所有访问的条目也会缓存在内存中，并在 sync() 和 close() 上写回；这可以更方便地改变持久字典中的可变条目，但是，如果访问了许多条目，它会消耗大量内存用于缓存，并且由于所有访问的条目都被写回，它会使关闭操作非常慢（无法确定哪些访问的条目是可变的，也无法确定哪些条目实际上是突变的）。

您可以避免使用writeback=True并确保数据只写入一次（您必须注意后续修改将丢失）。

如果您认为这不是正确的存储选项（不知道数据的结构很难说），我建议使用 sqlite3，它集成在 python 中（因此非常便携）并且具有非常好的性能。它比简单的键值存储要复杂一些。

请参阅其他答案以获取替代方案。

score 1 · Accepted Answer

大多少？访问模式是什么？您需要对其进行哪些类型的计算？

请记住，如果您无论如何都无法将表保存在内存中，那么您将遇到一些性能限制。

你可能想看看去 SQLAlchemy，或者直接使用类似的东西bsddb，但这两者都会牺牲代码的简单性。但是，使用 SQL，您可以根据工作负载将一些工作卸载到数据库层。

python - 大字典的搁置太慢，我可以做些什么来提高性能？

4 回答 4

Related

Reference