0

我正在尝试对通过从磁盘读取文件创建的大型字典进行一些分析。读取操作会导致稳定的内存占用。然后,我有一个方法可以根据我从该字典复制到临时字典中的数据执行一些计算。我这样做是为了让所有的复制和数据使用都在方法范围内,并且我希望在方法调用结束时消失。

可悲的是,我做错了什么。customerdict 定义如下(定义在 .py 变量的顶部):

customerdict = collections.defaultdict(dict)

对象的格式是 {customerid: dictionary{id: 0||1}}

还有一个类似定义的字典,称为 allids。

我有一种计算 sim_pearson 距离的方法(来自 Programming Collective Intelligence 书的修改代码),如下所示。

def sim_pearson(custID1, custID2):
si = []

smallcustdict = {}
smallcustdict[custID1] = customerdict[custID1]
smallcustdict[custID2] = customerdict[custID2]

#a loop to round out the remaining allids object to fill in 0 values
for customerID, catalog in smallcustdict.iteritems():
    for id in allids:
        if id not in catalog:
            smallcustdict[customerID][asin] = 0.0

#get the list of mutually rated items
for id in smallcustdict[custID1]:
    if id in smallcustdict[custID2]:
        si.append(id) # = 1

#return 0 if there are no matches
if len(si) == 0: return 0

#add up all the preferences
sum1 = sum([smallcustdict[custID1][id] for id in si])
sum2 = sum([smallcustdict[custID2][id] for id in si])

#sum up the squares
sum1sq = sum([pow(smallcustdict[custID1][id],2) for id in si])
sum2sq = sum([pow(smallcustdict[custID2][id],2) for id in si])

#sum up the products
psum = sum([smallcustdict[custID1][id] * smallcustdict[custID2][id] for id in si])

#calc Pearson score
num = psum - (sum1*sum2/len(si))
den = sqrt((sum1sq - pow(sum1,2)/len(si)) * (sum2sq - pow(sum2,2)/len(si)))

del smallcustdict
del si
del sum1
del sum2
del sum1sq
del sum2sq
del psum

if den == 0:
    return 0

return num/den

通过 sim_pearson 方法的每个循环都会无限地增加 python.exe 的内存占用。我尝试使用“del”方法显式删除局部范围变量。

查看任务管理器,内存以 6-10Mb 的增量增加。设置初始 customerdict 后,占用空间为 137Mb。

有什么想法为什么我这样做内存不足?

4

2 回答 2

3

我想问题在这里:

smallcustdict[custID1] = customerdict[custID1]
smallcustdict[custID2] = customerdict[custID2]

#a loop to round out the remaining allids object to fill in 0 values
for customerID, catalog in smallcustdict.iteritems():
    for id in allids:
        if id not in catalog:
            smallcustdict[customerID][asin] = 0.0

来自的字典customerdict被引用smallcustdict- 所以当你添加它们时,它们会持续存在。这是我能看到的唯一一点,你在哪里做任何会持续超出范围的事情,所以我想这就是问题所在。

请注意,通过不使用列表组合,重复做同样的事情,并且不使用通用的方式来做事情,您在许多地方为自己做了很多工作,更好的版本可能如下:

import collections
import functools
import operator

customerdict = collections.defaultdict(dict)

def sim_pearson(custID1, custID2):

    #Declaring as a dict literal is nicer.
    smallcustdict = {
        custID1: customerdict[custID1],
        custID2: customerdict[custID2],
    }

    # Unchanged, as I'm not sure what the intent is here.
    for customerID, catalog in smallcustdict.iteritems():
        for id in allids:
            if id not in catalog:
                smallcustdict[customerID][asin] = 0.0

    #dict views are set-like, so the easier way to do what you want is the intersection of the two.
    si = smallcustdict[custID1].viewkeys() & smallcustdict[custID2].viewkeys()

    #if not is a cleaner way of checking for no values.
    if not si:
        return 0

    #Made more generic to avoid repetition and wastefully looping repeatedly.
    parts = [list(part) for part in zip(*((value[id] for value in smallcustdict.values()) for id in si))]

    sums = [sum(part) for part in parts]
    sumsqs = [sum(pow(i, 2) for i in part) for part in parts]
    psum = sum(functools.reduce(operator.mul, part) for part in zip(*parts))

    sum1, sum2 = sums
    sum1sq, sum2sq = sumsqs

    #Unchanged.
    num = psum - (sum1*sum2/len(si))
    den = sqrt((sum1sq - pow(sum1,2)/len(si)) * (sum2sq - pow(sum2,2)/len(si)))

    #Again using if not.
    if not den:
        return 0
    else:
        return num/den

请注意,这完全未经测试,因为您提供的代码不是一个完整的示例。但是,它应该很容易用作改进的基础。

于 2012-11-10T01:07:57.047 回答
1

尝试更改以下两行:

smallcustdict[custID1] = customerdict[custID1]
smallcustdict[custID2] = customerdict[custID2]

smallcustdict[custID1] = customerdict[custID1].copy()
smallcustdict[custID2] = customerdict[custID2].copy()

这样,customerdictsim_pearson()函数返回时,您对两个字典所做的更改不会持续存在。

于 2012-11-10T03:06:33.907 回答