1

我有一个循环,可以在其中获取列表列表:

for i in range(num_exp):
  li = func()

其中li是表单列表的列表 [["s1", 1, 2], ["s2", 2, 3], ["s3", 3, 4]] (第一项是字符串,其余 2 项是数字)

li我想在一个循环中平均每个数值。所以对于num_exp = 3和李的

[["s1", 1, 2], ["s2", 3, 4], ["s3", 5, 6]]
[["s1", 2, 3], ["s2", 4, 5], ["s3", 6, 7]]
[["s1", 3, 4], ["s2", 5, 6], ["s3", 7, 8]]

我会得到

[["s1", 6/3, 9/3], ["s2", 12/3, 15/3], ["s3", 18/3, 21/3]]

我想用 numpy 来做。在简单的python中,我执行以下操作

 dic = {}
 for l in li:
     if l[0] not in dic:
        dic[l[0]] = l[1:]
      else:
        dic[l[0]][0] += l[1] 
        dic[l[0]][1] += l[2] 

 fl = []
 for m in dic:
    fl.append([m, dic[m][0]/num_exp, dic[m[1]/num_exp])

但它似乎相当低效

4

5 回答 5

5

从指定np.array列表的列表创建并将相同的分组到同一组中。在轴 2(最右边的轴)上切片 2 个最后一个元素,然后将其除以. 最后,它的唯一字符串值。lidtype='object'swapaxesssumnum_expcolumn_stack

num_exp = 3
li = [[["s1", 1, 2], ["s2", 3, 4], ["s3", 5, 6]],
      [["s1", 2, 3], ["s2", 4, 5], ["s3", 6, 7]],
      [["s1", 3, 4], ["s2", 5, 6], ["s3", 7, 8]]]

arr = np.array(li, dtype='object').swapaxes(0, 1)

Out[372]:
array([[['s1', 1, 2],
        ['s1', 2, 3],
        ['s1', 3, 4]],

       [['s2', 3, 4],
        ['s2', 4, 5],
        ['s2', 5, 6]],

       [['s3', 5, 6],
        ['s3', 6, 7],
        ['s3', 7, 8]]], dtype=object)

arr1 = arr[...,[1,2]].sum(axis=1) / num_exp

Out[380]:
array([[2.0, 3.0],
       [4.0, 5.0],
       [6.0, 7.0]], dtype=object)

s = arr[:,0, 0]
result = np.column_stack([s, arr1])

Out[389]:
array([['s1', 2.0, 3.0],
       ['s2', 4.0, 5.0],
       ['s3', 6.0, 7.0]], dtype=object)
于 2019-08-13T01:29:51.167 回答
2

这是纯 python 解决方案与 numpy 解决方案的速度比较。

In [167]: alist                                                                                              
Out[167]: 
[[['s1', 1, 2], ['s2', 3, 4], ['s3', 5, 6]],
 [['s1', 2, 3], ['s2', 4, 5], ['s3', 6, 7]],
 [['s1', 3, 4], ['s2', 5, 6], ['s3', 7, 8]]]

使用集合中的 defaultdict:

In [169]: def foo1(alist): 
     ...:     dd = defaultdict(list) 
     ...:     for row in alist: 
     ...:         for col in row: 
     ...:             dd[col[0]].append(col[1:]) 
     ...:     return [[k, np.mean(v,0)] for k,v in dd.items()] 
     ...:                                                                                                    
In [170]: foo1(alist)                                                                                        
Out[170]: [['s1', array([2., 3.])], ['s2', array([4., 5.])], ['s3', array([6., 7.])]]

此列表并不完美,但足以用于测试目的。也不是很纯的 Python,因为我使用np.mean的是每个键。

使用 3d 对象 dtype 数组的 numpy 解决方案(保留字符串):

In [171]: def foo2(alist): 
     ...:     arr = np.array(alist, object) 
     ...:     lbl = arr[0,:,0][:,None]  
     ...:     res = arr[:,:,1:].mean(axis=0) 
     ...:     return np.concatenate((lbl,res),axis=1) 
     ...:                                                                                                    
In [172]: foo2(alist)                                                                                        
Out[172]: 
array([['s1', 2.0, 3.0],
       ['s2', 4.0, 5.0],
       ['s3', 6.0, 7.0]], dtype=object)

一些时间安排:

In [173]: timeit foo1(alist)                                                                                 
98.2 µs ± 256 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [174]: timeit foo2(alist)                                                                                 
42.1 µs ± 1.02 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

对于一个大清单:

In [175]: blist=alist*10000                                                                                  
In [176]: timeit foo1(blist)                                                                                 
71.9 ms ± 350 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [177]: timeit foo2(blist)                                                                                 
46.8 ms ± 489 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

因此,numpy 的速度优势大约是 2 倍。不错,但不是成败的优势。

===

我修改了 defaultdict 函数以使用它自己的mean函数:

In [188]: def foo11(alist): 
     ...:     nexp = len(alist) 
     ...:     def mean(v): 
     ...:        return [sum(i)/nexp for i in zip(*v)] 
     ...:     dd = defaultdict(list) 
     ...:     for row in alist: 
     ...:         for col in row: 
     ...:             dd[col[0]].append(col[1:]) 
     ...:     return [[k, *mean(v)] for k,v in dd.items()] 
     ...:      
     ...:                                                                                                    
In [189]: foo11(alist)                                                                                       
Out[189]: [['s1', 2.0, 3.0], ['s2', 4.0, 5.0], ['s3', 6.0, 7.0]]

In [190]: timeit foo11(alist)                                                                                
9.43 µs ± 13 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [191]: timeit foo11(blist)                                                                                
51.9 ms ± 206 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

这对于小示例来说要快得多,并且与大示例的速度大致相同foo2

于 2019-08-13T02:15:07.350 回答
1

当您尝试进行计算时,字符串会很麻烦,因此请剥离它们,进行计算,然后将它们放回去。

data = []
for i in range(num_exp):
    li = func()
    # Goodbye strings
    data.append([elm[1:] for elm in li])

averages = np.mean(data, axis=0)
于 2019-08-13T00:54:08.660 回答
1

如果你想用 numpy 在一行中做所有事情

[np.concatenate((li[0][x][0:1], li[:,x][:,1:].astype('float').mean(axis=0).astype('S1'))) for x in np.arange(0,num_exp)]

但是,您可能会发现 PandasDataFrame提供了更实用的 API 来处理混合数据类型的数组。

import pandas as pd
pd.DataFrame([[1,2,3,4,5,6],[2,3,4,5,6,7],[3,4,5,6,7,8]],columns=['s1','s1','s2','s2','s3','s3']).mean()

于 2019-08-13T02:00:31.020 回答
0

假设您已经有了函数func()and num_exp = 3,首先您应该func在 for 循环中执行所需的多次,并根据字符串键添加结果。由于我们已经知道要执行多少次func,我们可以将返回值除以这个数字。我希望li之后有你的键和数字结构。

result_dict = dict()
for i in range(num_exp):
    li = func(i)

    for l in li:
        sums = result_dict.get(l[0], np.zeros(len(l) - 1))
        result_dict[l[0]] = (l[1:] / num_exp) + sums

result_dict看起来像这样:{'s1': [2.0, 3.0], 's2': [4.0, 5.0], 's3': [6.0, 7.0]}

现在我们只需要将字典转换为您想要的结构,我们就完成了。

result = [[key, *arr] for (key, arr) in result_dict.items()]

这创建result[['s1', 2.0, 3.0], ['s2', 4.0, 5.0], ['s3', 6.0, 7.0]]

于 2019-08-13T00:17:05.837 回答