这是纯 python 解决方案与 numpy 解决方案的速度比较。
In [167]: alist
Out[167]:
[[['s1', 1, 2], ['s2', 3, 4], ['s3', 5, 6]],
[['s1', 2, 3], ['s2', 4, 5], ['s3', 6, 7]],
[['s1', 3, 4], ['s2', 5, 6], ['s3', 7, 8]]]
使用集合中的 defaultdict:
In [169]: def foo1(alist):
...: dd = defaultdict(list)
...: for row in alist:
...: for col in row:
...: dd[col[0]].append(col[1:])
...: return [[k, np.mean(v,0)] for k,v in dd.items()]
...:
In [170]: foo1(alist)
Out[170]: [['s1', array([2., 3.])], ['s2', array([4., 5.])], ['s3', array([6., 7.])]]
此列表并不完美,但足以用于测试目的。也不是很纯的 Python,因为我使用np.mean的是每个键。
使用 3d 对象 dtype 数组的 numpy 解决方案(保留字符串):
In [171]: def foo2(alist):
...: arr = np.array(alist, object)
...: lbl = arr[0,:,0][:,None]
...: res = arr[:,:,1:].mean(axis=0)
...: return np.concatenate((lbl,res),axis=1)
...:
In [172]: foo2(alist)
Out[172]:
array([['s1', 2.0, 3.0],
['s2', 4.0, 5.0],
['s3', 6.0, 7.0]], dtype=object)
一些时间安排:
In [173]: timeit foo1(alist)
98.2 µs ± 256 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [174]: timeit foo2(alist)
42.1 µs ± 1.02 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
对于一个大清单:
In [175]: blist=alist*10000
In [176]: timeit foo1(blist)
71.9 ms ± 350 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [177]: timeit foo2(blist)
46.8 ms ± 489 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
因此,numpy 的速度优势大约是 2 倍。不错,但不是成败的优势。
===
我修改了 defaultdict 函数以使用它自己的mean函数:
In [188]: def foo11(alist):
...: nexp = len(alist)
...: def mean(v):
...: return [sum(i)/nexp for i in zip(*v)]
...: dd = defaultdict(list)
...: for row in alist:
...: for col in row:
...: dd[col[0]].append(col[1:])
...: return [[k, *mean(v)] for k,v in dd.items()]
...:
...:
In [189]: foo11(alist)
Out[189]: [['s1', 2.0, 3.0], ['s2', 4.0, 5.0], ['s3', 6.0, 7.0]]
In [190]: timeit foo11(alist)
9.43 µs ± 13 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [191]: timeit foo11(blist)
51.9 ms ± 206 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
这对于小示例来说要快得多,并且与大示例的速度大致相同foo2。