python - 默认字典的内存错误（int）

Question

我正在使用 defaultdict(int) 来记录一组书中单词出现的次数。

当我得到内存异常时，Python 正在消耗 1.5 Gigs 的内存：

  File "C:\Python32\lib\collections.py", line 540, in update
    _count_elements(self, iterable)
MemoryError

我的柜台大小超过 8,000,000。

我至少有 20,000,000 个独特的单词要数。我该怎么做才能避免内存异常？

score 1 · Accepted Answer

即使您有一个带有大量内存的 64 位系统，我认为使用dict. 你应该使用数据库。

/* If we added a key, we can safely resize.  Otherwise just return!
 * If fill >= 2/3 size, adjust size.  Normally, this doubles or
 * quaduples the size, but it's also possible for the dict to shrink
 * (if ma_fill is much larger than ma_used, meaning a lot of dict
 * keys have been * deleted).
 *
 * Quadrupling the size improves average dictionary sparseness
 * (reducing collisions) at the cost of some memory and iteration
 * speed (which loops over every possible entry).  It also halves
 * the number of expensive resize operations in a growing dictionary.
 *
 * Very large dictionaries (over 50K items) use doubling instead.
 * This may help applications with severe memory constraints.
 */
if (!(mp->ma_used > n_used && mp->ma_fill*3 >= (mp->ma_mask+1)*2))
    return 0;
return dictresize(mp, (mp->ma_used > 50000 ? 2 : 4) * mp->ma_used);

从代码中可以看出，如果您插入太多项目，则 dict 必须增长 - 不仅为包含的项目提供空间，还为新项目提供插槽。它说如果超过 2/3 的 dict 被填充，则 dict 的大小将增加一倍（或少于 50,000 个项目的四倍）。我个人使用字典来包含少于几十万个项目。即使少于一百万个项目，它也消耗几千兆字节，几乎冻结了我的 8GB win7 机器。

如果您只是计算项目，您可以：

spilt the words in chunk
count the words in each chunk
update the database

使用合理的块大小，执行一些数据库查询（假设数据库访问将成为瓶颈）会好得多。

python - 默认字典的内存错误（int）

1 回答 1

Related

Reference