-1

I have data in following form:

key, time_bin, count
abc, 1, 200
abc, 2,230
abc1,1,300
abc1,2,180
abc2,1, 300
abc2,2, 800

So each of the key has same number of time_bin..

I want to find the following.. For each,time bin which are the top n keys based on count..

So, in the example above.. lets say I want to find out.. what are the top 2 keys for each time bin? So..answer is

1=> [{"abc1",300},{"abc2":300}]
2=> ({"abc2":800},{"abc":230}]

WHat is a good way to solve this?

4

1 回答 1

3

collections.Counter与 一起使用collections.defaultdict

from collections import Counter, defaultdict
import csv

counts = defaultdict(Counter)

with open(somefilename, 'rb') as f:
    reader = csv.reader(f)
    next(reader)  # skip the header
    for row in reader:
        key, time_bin, count = row[0], int(row[1]), int(row[2])
        counts[time_bin][key] += count

for time_bin in counts:
    print '{}=> {}'.format(time_bin, counts[time_bin].most_common(2))

Counter.most_common()方法在这里特别有用;它返回给定计数集的最高计数,此处按时间 bin 收集。

输出格式几乎与您的示例匹配:

1=> [('abc1', 300), ('abc2', 300)]
2=> [('abc2', 800), ('abc', 230)]

因为.most_common()返回元组列表,而不是字典。

于 2013-04-11T21:21:51.007 回答