即使您有一个带有大量内存的 64 位系统,我认为使用dict
. 你应该使用数据库。
/* If we added a key, we can safely resize. Otherwise just return!
* If fill >= 2/3 size, adjust size. Normally, this doubles or
* quaduples the size, but it's also possible for the dict to shrink
* (if ma_fill is much larger than ma_used, meaning a lot of dict
* keys have been * deleted).
*
* Quadrupling the size improves average dictionary sparseness
* (reducing collisions) at the cost of some memory and iteration
* speed (which loops over every possible entry). It also halves
* the number of expensive resize operations in a growing dictionary.
*
* Very large dictionaries (over 50K items) use doubling instead.
* This may help applications with severe memory constraints.
*/
if (!(mp->ma_used > n_used && mp->ma_fill*3 >= (mp->ma_mask+1)*2))
return 0;
return dictresize(mp, (mp->ma_used > 50000 ? 2 : 4) * mp->ma_used);
从代码中可以看出,如果您插入太多项目,则 dict 必须增长 - 不仅为包含的项目提供空间,还为新项目提供插槽。它说如果超过 2/3 的 dict 被填充,则 dict 的大小将增加一倍(或少于 50,000 个项目的四倍)。我个人使用字典来包含少于几十万个项目。即使少于一百万个项目,它也消耗几千兆字节,几乎冻结了我的 8GB win7 机器。
如果您只是计算项目,您可以:
spilt the words in chunk
count the words in each chunk
update the database
使用合理的块大小,执行一些数据库查询(假设数据库访问将成为瓶颈)会好得多。