8

我有一个值列表。我希望在循环期间计算每个类的元素数(即 1、2、3、4、5)

mylist = [1,1,1,1,1,1,2,3,2,2,2,2,3,3,4,5,5,5,5]
mydict = dict()
for index in mylist:
    mydict[index] = +1
mydict
Out[344]: {1: 1, 2: 1, 3: 1, 4: 1, 5: 1}

我希望得到这个结果

Out[344]: {1: 6, 2: 5, 3: 3, 4: 1, 5: 4}
4

5 回答 5

14

For your smaller example, with a limited diversity of elements, you can use a set and a dict comprehension:

>>> mylist = [1,1,1,1,1,1,2,3,2,2,2,2,3,3,4,5,5,5,5]
>>> {k:mylist.count(k) for k in set(mylist)}
{1: 6, 2: 5, 3: 3, 4: 1, 5: 4}

To break it down, set(mylist) uniquifies the list and makes it more compact:

>>> set(mylist)
set([1, 2, 3, 4, 5])

Then the dictionary comprehension steps through the unique values and sets the count from the list.

This also is significantly faster than using Counter and faster than using setdefault:

from __future__ import print_function
from collections import Counter
from collections import defaultdict
import random

mylist=[1,1,1,1,1,1,2,3,2,2,2,2,3,3,4,5,5,5,5]*10

def s1(mylist):
    return {k:mylist.count(k) for k in set(mylist)}

def s2(mlist):
    return Counter(mylist)

def s3(mylist):
    mydict=dict()
    for index in mylist:
        mydict[index] = mydict.setdefault(index, 0) + 1
    return mydict   

def s4(mylist):
    mydict={}.fromkeys(mylist,0)
    for k in mydict:
        mydict[k]=mylist.count(k)    
    return mydict    

def s5(mylist):
    mydict={}
    for k in mylist:
        mydict[k]=mydict.get(k,0)+1
    return mydict     

def s6(mylist):
    mydict=defaultdict(int)
    for i in mylist:
        mydict[i] += 1
    return mydict       

def s7(mylist):
    mydict={}.fromkeys(mylist,0)
    for e in mylist:
        mydict[e]+=1    
    return mydict    

if __name__ == '__main__':   
    import timeit 
    n=1000000
    print(timeit.timeit("s1(mylist)", setup="from __main__ import s1, mylist",number=n))
    print(timeit.timeit("s2(mylist)", setup="from __main__ import s2, mylist, Counter",number=n))
    print(timeit.timeit("s3(mylist)", setup="from __main__ import s3, mylist",number=n))
    print(timeit.timeit("s4(mylist)", setup="from __main__ import s4, mylist",number=n))
    print(timeit.timeit("s5(mylist)", setup="from __main__ import s5, mylist",number=n))
    print(timeit.timeit("s6(mylist)", setup="from __main__ import s6, mylist, defaultdict",number=n))
    print(timeit.timeit("s7(mylist)", setup="from __main__ import s7, mylist",number=n))

On my machine that prints (Python 3):

18.123854104997008          # set and dict comprehension 
78.54796334600542           # Counter 
33.98185228800867           # setdefault 
19.0563529439969            # fromkeys / count 
34.54294775899325           # dict.get 
21.134678319009254          # defaultdict 
22.760544238000875          # fromkeys / loop

For Larger lists, like 10 million integers, with more diverse elements (1,500 random ints), use defaultdict or fromkeys in a loop:

from __future__ import print_function
from collections import Counter
from collections import defaultdict
import random

mylist = [random.randint(0,1500) for _ in range(10000000)]

def s1(mylist):
    return {k:mylist.count(k) for k in set(mylist)}

def s2(mlist):
    return Counter(mylist)

def s3(mylist):
    mydict=dict()
    for index in mylist:
        mydict[index] = mydict.setdefault(index, 0) + 1
    return mydict   

def s4(mylist):
    mydict={}.fromkeys(mylist,0)
    for k in mydict:
        mydict[k]=mylist.count(k)    
    return mydict    

def s5(mylist):
    mydict={}
    for k in mylist:
        mydict[k]=mydict.get(k,0)+1
    return mydict     

def s6(mylist):
    mydict=defaultdict(int)
    for i in mylist:
        mydict[i] += 1
    return mydict       

def s7(mylist):
    mydict={}.fromkeys(mylist,0)
    for e in mylist:
        mydict[e]+=1    
    return mydict    

if __name__ == '__main__':   
    import timeit 
    n=1
    print(timeit.timeit("s1(mylist)", setup="from __main__ import s1, mylist",number=n))
    print(timeit.timeit("s2(mylist)", setup="from __main__ import s2, mylist, Counter",number=n))
    print(timeit.timeit("s3(mylist)", setup="from __main__ import s3, mylist",number=n))
    print(timeit.timeit("s4(mylist)", setup="from __main__ import s4, mylist",number=n))
    print(timeit.timeit("s5(mylist)", setup="from __main__ import s5, mylist",number=n))
    print(timeit.timeit("s6(mylist)", setup="from __main__ import s6, mylist, defaultdict",number=n))
    print(timeit.timeit("s7(mylist)", setup="from __main__ import s7, mylist",number=n))

Prints:

2825.2697427899984              # set and dict comprehension 
42.607481333994656              # Counter 
22.77713537499949               # setdefault 
2853.11187016801                # fromkeys / count 
23.241977066005347              # dict.get 
15.023175164998975              # defaultdict 
18.28165417900891               # fromkeys / loop

You can see that solutions that relay on count with a moderate number of times through the large list will suffer badly/catastrophically in comparison to other solutions.

于 2013-08-20T23:28:03.433 回答
6

尝试collections.Counter

   >>> from collections import Counter
   >>> Counter([1,1,1,1,1,1,2,3,2,2,2,2,3,3,4,5,5,5,5])
   Counter({1: 6, 2: 5, 5: 4, 3: 3, 4: 1})

在您的代码中,您基本上可以mydict用 a替换Counter并编写

mydict[index] += 1

代替

mydict[index] = +1
于 2013-08-20T19:30:24.113 回答
4

纠正代码:

mydict[index] = +1

应该:

mydict[index] = mydict.setdefault(index, 0) + 1
于 2013-08-20T19:30:34.623 回答
4

setdefault该方法的一个变体是collections.defaultdict. 这有点快。

def foo(mylist):
    d=defaultdict(int)
    for i in mylist:
        d[i] += 1
    return d

itertools.groupBy提供了另一种选择。Counter它的速度与(至少在 2.7 上)大致相同

{x[0]:len(list(x[1])) for x in itertools.groupby(sorted(mylist))}

但是,在处理 OP 在评论中提到的 32Gb 数据时,这个小测试列表上的时间测试可能不一样。


我在python top N word count的 word count 案例中运行了其中几个选项,为什么多进程比单进程慢

在那里,OP 使用了 Counter,并试图通过使用多处理来加快速度。使用 1.2Mb 的文本文件,计数器使用defaultdict速度很快,只需 0.2 秒。对输出进行排序以获得前 40 个单词所花费的时间与计数本身一样长。

Counter有点慢3.2,而且慢得多2.7。那是因为3.2编译版本(.so文件)。

mylist.count但是在处理大量列表时,使用地面的计数器会停止;将近200秒。它必须多次搜索该大列表,一次收集密钥,然后在计算每个密钥时搜索一次。

于 2013-08-21T05:17:42.143 回答
1

您的代码将 1 分配为每个键的值。替换mydict[index] = +1mylist.count(index)

这应该有效:

mylist = [1,1,1,1,1,1,2,3,2,2,2,2,3,3,4,5,5,5,5]
mydict = dict()
for index in mylist:
    mydict[index] = mylist.count(index)
mydict
于 2013-08-20T20:49:45.080 回答