python - 具有稀疏数据的列表消耗的内存少于与 numpy 数组相同的数据

Question

我正在使用非常高维的向量进行机器学习，并且正在考虑使用 numpy 来减少使用的内存量。我运行了一个快速测试，看看使用 numpy (1)(3) 可以节省多少内存：

标准清单

import random
random.seed(0)
vector = [random.random() for i in xrange(2**27)]

numpy 数组

import numpy
import random
random.seed(0)
vector = numpy.fromiter((random.random() for i in xrange(2**27)), dtype=float)

内存使用 (2)

Numpy array: 1054 MB
Standard list: 2594 MB

正如我所料。

通过使用本机浮点数分配一个连续的内存块，numpy 仅消耗标准列表正在使用的内存的大约一半。

因为我知道我的数据非常空闲，所以我对稀疏数据进行了相同的测试。

标准清单

import random
random.seed(0)
vector = [random.random() if random.random() < 0.00001 else 0.0 for i in xrange(2 ** 27)]

numpy 数组

from numpy import fromiter
from random import random
random.seed(0)
vector = numpy.fromiter((random.random() if random.random() < 0.00001 else 0.0 for i in xrange(2 ** 27)), dtype=float)

内存使用 (2)

Numpy array: 1054 MB
Standard list: 529 MB

现在突然之间，python 列表使用了 numpy 数组使用的内存量的一半！为什么？

One thing I could think of is that python dynamically switches to a dict representation when it detects that it contains very sparse data. Checking this could potentially add a lot of extra run-time overhead so I don't really think that this is going on.

Notes

I started a fresh new python shell for every test.
Memory measured with htop.
Run on 32bit Debian.

score 2 · Accepted Answer

A Python list is just an array of references (pointers) to Python objects. In CPython (the usual Python implementation) a list gets slightly over-allocated to make expansion more efficient, but it never gets converted to a dict. See the source code for further details: List object implementation

In the sparse version of the list, you have a lot of pointers to a single int 0 object. Those pointers take up 32 bits = 4 bytes, but your numpy floats are certainly larger, probably 64 bits.

FWIW, to make the sparse list / array tests more accurate you should call random.seed(some_const) with the same seed in both versions so that you get the same number of zeroes in both the Python list and the numpy array.

python - 具有稀疏数据的列表消耗的内存少于与 numpy 数组相同的数据

1 回答 1

Related

Reference