0

我有一个看起来像这样的表:

id  value
AGA 0.211
AGA 0.433
AGA 0.123
AGH 0.002
DHI 0.063
DHI 0.193
DHI 0.004
KHI 0.543
KHI 0.064
HID 0.234

对于每个 id,有时会有不同的值。我想计算每个 id 有多少个入口,每个 id 的平均值和值的总和,所以结果会是这样的:

id      cnt   sum   av
AGA     3     0.76  0.25
AGH     1     0.002 0.002
DHI     3     0.26  0.008
KHI     2     0.607 0.304
HID     1     0.234 0.234 

我认为最好先制作一本字典,在其中计算每个条目,但之后就卡住了,不知道是否最好将字典的值作为数组(带有cnt、sum和av)和然后使用Cnt的范围来计算,但想不出办法!这是我走了多远:

idDict = {}
for line in file:
    line = line.rstrip()
    f = line.split()
    id = f[0]
    idDict[id] = idDict.get(id, 0) + 1

但是如果我已经在这里用cnt创建了字典,我不知道如何遍历每个id来进行sum和av计算:(

4

2 回答 2

2

这是一种开始的方法defaultdict

from collections import defaultdict

mylist=[('AGA' ,0.211), ('AGA' ,0.433), ('AGA' ,0.123), ('AGH' ,0.002), 
        ('DHI', 0.063), ('DHI' ,0.193), ('DHI' ,0.004), ('KHI' ,0.543),
        ('KHI' ,0.064), ('HID' ,0.234)]

mydict = defaultdict(list)
for key, val in mylist:
    mydict[key].append(val)

summary = {}
for key, val in mydict.items():
    summary[key] = len(val), sum(val), sum(val)/len(val)

print summary
#Output:
{'KHI': (2, 0.60699999999999998, 0.30349999999999999), 
 'HID': (1, 0.23400000000000001, 0.23400000000000001), 
 'AGA': (3, 0.76700000000000002, 0.25566666666666665), 
 'DHI': (3, 0.26000000000000001, 0.08666666666666667), 
 'AGH': (1, 0.002, 0.002)}
于 2012-06-08T15:26:05.570 回答
1

由于您表中的数据似乎已排序,实际上没有必要先将所有内容放入字典中,但这可能会使事情更清楚。但我猜你的桌子可能会变得很大,所以第二次存储所有东西是一个资源杀手......

def sum_up(id, list):
    counted = len(list)
    summed = sum(list)
    avrg = summed/counted
    # print, insert or do whatever needed with the lines:
    print counted, summed, avrg

last_id = None
current = []
for line in file:
    (id, value) = line.split()
    if last_id != id:
        if last_id is not None:
            # evaluate last id
            sum_up(last_id, current)
            current = []
        # remember id
        last_id = id
    # append to current ids entries
    current.append(value)

# do the last id, if there is any:
if len(current) > 0:
    sum_up(last_id, current)

我没有测试那个代码,但你应该明白了。它看起来有点复杂,但是当你有 >100k 行左右时,你应该会感觉到首先将所有内容加载到内存中然后处理它的不同之处

于 2012-06-08T15:39:14.287 回答