python - pandas (pandas.pydata.org) 什么时候在 df.sortlevel(k) 上抛出内存错误？

Question

我有一个相当大的数据集 (2678271, 52) 和一个消耗机器内存 6.5% 的 5 维索引。当我打电话

df.sortlevel(k)

我收到以下错误：



MemoryError                               Traceback (most recent call last)
 in ()
----> 1 df = df.sortlevel(4)

/usr/local/lib/python2.7/dist-packages/pandas-0.9.1-py2.7-linux-x86_64.egg/pandas/core/frame.pyc in sortlevel(self, level, axis, ascending)
   2978             raise Exception('can only sort by level with a hierarchical index')
   2979 
-> 2980         new_axis, indexer = the_axis.sortlevel(level, ascending=ascending)
   2981 
   2982         if self._data.is_mixed_dtype():

/usr/local/lib/python2.7/dist-packages/pandas-0.9.1-py2.7-linux-x86_64.egg/pandas/core/index.pyc in sortlevel(self, level, ascending)
   1856         indexer = _indexer_from_factorized((primary,) + tuple(labels),
   1857                                            (primshp,) + tuple(shape),
-> 1858                                            compress=False)
   1859         if not ascending:
   1860             indexer = indexer[::-1]

/usr/local/lib/python2.7/dist-packages/pandas-0.9.1-py2.7-linux-x86_64.egg/pandas/core/groupby.pyc in _indexer_from_factorized(labels, shape, compress)
   2124         max_group = np.prod(shape)
   2125 
-> 2126     indexer, _ = lib.groupsort_indexer(comp_ids.astype(np.int64), max_group)
   2127 
   2128     return indexer

/usr/local/lib/python2.7/dist-packages/pandas-0.9.1-py2.7-linux-x86_64.egg/pandas/lib.so in pandas.lib.groupsort_indexer (pandas/src/tseries.c:55052)()

MemoryError:

是否存在引发此错误的硬编码条件？或者是否有可能即使数据只使用了 6.5% 的内存（根据 htop），操作也会吃掉剩余的内存？

score 2 · Accepted Answer

你能把它移到 GitHub 上吗？我需要查看代码，但有许多边缘情况我没有测试真正深入的“分级”分层索引。所以这可能是一个合法的错误。

编辑：这已在 v0.10.1 中修复

python - pandas (pandas.pydata.org) 什么时候在 df.sortlevel(k) 上抛出内存错误？

1 回答 1

Related

Reference