python - efficient way to hold and process a big dict in memory in python

Question

As I did a bit test, a python dict of int=>int (different value) of 30 million items can easily eats >2G memory on my mac. Since I work with only int to int dict, is there any better solution than using python dict?

Some requirements I need are,

more memory efficient at holding tens of million level of int to int items
basic dict methods like fetching value by key and iterating all items
easy to serialise to string / binary would be a plus

Update, 4. easy to get subset by given keys, like d.fromkeys([...])

Thanks.

score 9 · Accepted Answer

至少有两种可能：

数组

您可以尝试使用两个数组。一个用于键，一个用于值，因此 index(key) == index(value)

2017-01-05 更新：在数组中使用 4 字节整数。

数组将使用更少的内存。在使用 clang 编译的 python 的 64 位 FreeBSD 机器上，包含 3000 万个整数的数组使用大约 117 MiB。

这些是我使用的python命令：

Python 2.7.13 (default, Dec 28 2016, 20:51:25) 
[GCC 4.2.1 Compatible FreeBSD Clang 3.8.0 (tags/RELEASE_380/final 262564)] on freebsd11
Type "help", "copyright", "credits" or "license" for more information.
>>> from array import array
>>> a = array('i', xrange(30000000))
>>> a.itemsize
4

导入数组后，ps报告：

USER     PID %CPU %MEM   VSZ  RSS TT  STAT STARTED    TIME COMMAND
 rsmith 81023  0.0  0.2  35480   8100  0  I+   20:35     0:00.03 python (python2.7)

制作数组后：

USER     PID %CPU %MEM    VSZ    RSS TT  STAT STARTED    TIME COMMAND
rsmith 81023 29.0  3.1 168600 128776  0  S+   20:35     0:04.52 python (python2.7)

驻留集大小以 1 KiB 为单位报告，因此 (128776 - 8100)/1024 = 117 MiB

使用列表推导，您可以轻松获得键满足特定条件的索引列表。然后，您可以使用该列表中的索引来访问相应的值...

麻木的

如果你有 numpy 可用，那么使用它会更快，有更多的功能，并且使用的 RAM 会稍微少一些：

Python 2.7.5 (default, Jun 10 2013, 19:54:11) 
[GCC 4.2.1 Compatible FreeBSD Clang 3.1 ((branches/release_31 156863))] on freebsd9
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> a = np.arange(0, 30000000, dtype=np.int32)

来自ps：启动 Python 后为 6700 KiB，导入 numpy 后为 17400 KiB，创建数组后为 134824 KiB。这大约是 114 MiB。

此外，numpy 支持记录数组；

Python 2.7.5 (default, Jun 10 2013, 19:54:11) 
[GCC 4.2.1 Compatible FreeBSD Clang 3.1 ((branches/release_31 156863))] on freebsd9
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> a = np.zeros((10,), dtype=('i4,i4'))
>>> a
array([(0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0),
       (0, 0), (0, 0)], 
      dtype=[('f0', '<i4'), ('f1', '<i4')])
>>> a.dtype.names
('f0', 'f1')
>>> a.dtype.names = ('key', 'value')
>>> a
array([(0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0), (0, 0),
       (0, 0), (0, 0)], 
      dtype=[('key', '<i4'), ('value', '<i4')])
>>> a[3] = (12, 5429)
>>> a
array([(0, 0), (0, 0), (0, 0), (12, 5429), (0, 0), (0, 0), (0, 0), (0, 0),
       (0, 0), (0, 0)], 
      dtype=[('key', '<i4'), ('value', '<i4')])
>>> a[3]['key']
12

在这里，您可以分别访问键和值；

>>> a['key']
array([ 0,  0,  0, 12,  0,  0,  0,  0,  0,  0], dtype=int32)

score 3 · Accepted Answer

基于 Judy-array 的解决方案似乎是我应该研究的选项。我仍在寻找可以被 Python 使用的良好实现。稍后会更新。

更新，

最后，我正在http://code.google.com/p/py-judy/上试验 Judy 数组包装器。那里似乎没有任何文档，但我试图通过 dir(...) 它的包和对象简单地找到它的方法，但是它可以工作。

同样的实验，它使用 judy.JudyIntObjectMap 在标准字典的 1/3 处吃掉了 986MB。它还提供了 JudyIntSet，它在某些特殊情况下将节省更多内存，因为与 JudyIntObjectMap 相比，它不需要引用任何真正的 Python 对象作为值。

（如下进一步测试，JudyArray 只使用了几 MB 到几十 MB，其中大部分 ~986MB 实际上是由 Python 内存空间中的值对象使用的。）

如果对您有帮助，这里有一些代码，

>>> import judy
>>> dir(judy)
['JudyIntObjectMap', 'JudyIntSet', '__doc__', '__file__', '__name__', '__package__']
>>> a=judy.JudyIntObjectMap()
>>> dir(a)
['__class__', '__contains__', '__delattr__', '__delitem__', '__doc__', '__format__', '__getattribute__', '__getitem__', '__hash__', '__init__', '__iter__', '__len__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', '__value_sizeof__', 'by_index', 'clear', 'get', 'iteritems', 'iterkeys', 'itervalues', 'pop']
>>> a[100]=1
>>> a[100]="str"
>>> a["str"]="str"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'non-integer keys not supported'
>>> for i in xrange(30000000):
...     a[i]=i+30000000   #finally eats ~986MB memory
...

更新，

好的，经过测试的 30M int 的 JudyIntSet。

>>> a=judy.JudyIntSet()
>>> a.add(1111111111111111111111111)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: we only support integers in the range [0, 2**64-1]

它完全只使用 5.7MB 来存储 30M 顺序 int 数组 [0,30000000)，这可能是由于 JudyArray 的自动压缩。709MB 以上是 bcz 我使用 range(...) 而不是更合适的 xrange(...) 来生成数据。

所以核心 JudyArray 与 30M int 的大小简直可以忽略不计。

如果有人知道更完整的 Judy Array 包装器实现，请告诉我，因为此包装器仅包装 JudyIntObjectMap 和 JudyIntSet。对于 int-int dict，JudyIntObjectMap 仍然需要真正的 python 对象。如果我们只对值进行 counter_add 和设置，那么将值的 int 存储在 C 空间中而不是使用 python 对象将是一个好主意。希望有人有兴趣创建或介绍一个:)

score 2 · Accepted Answer

如果您想要的只是一个易于使用的类似字典的计数器，则会添加另一个答案。

来自 Python 标准库的高性能 Counter 对象

score 1 · Accepted Answer

如果我们对如何使用它有更多了解，可能会更容易提出好的解决方案。您说您想通过键获取值并遍历所有值，但没有关于是否需要插入/删除数据。

存储数据的一种非常有效的方法是使用数组模块。如果您不需要插入/删除数据，您可以简单地拥有两个数组。“键”数组将被排序，您可以对正确的键进行二进制搜索。然后你只需从另一个数组中的相同位置选择值。

您可以轻松地将其封装在行为类似 dict 的类中。我不知道某处是否有现成的解决方案，但实施起来应该不会非常困难。这应该可以帮助您避免使用大量消耗内存的 python 对象。

但是您可能有其他要求使这种解决方案不切实际/不可能。

python - efficient way to hold and process a big dict in memory in python

4 回答 4

Related

Reference