python - 如何聚合 NumPy 记录数组（总和、最小值、最大值等）？

Question

考虑一个简单的记录数组结构：

import numpy as np
ijv_dtype = [
    ('I', 'i'),
    ('J', 'i'),
    ('v', 'd'),
]
ijv = np.array([
    (0, 0, 3.3),
    (0, 1, 1.1),
    (0, 1, 4.4),
    (1, 1, 2.2),
    ], ijv_dtype)
print(ijv)  # [(0, 0, 3.3) (0, 1, 1.1) (0, 1, 4.4) (1, 1, 2.2)]

我想通过对和的唯一组合进行分组来汇总某些统计数据（总和、最小值、最大值等）。从 SQL 考虑，预期的结果是：vIJ

select i, j, sum(v) as v from ijv group by i, j;
 i | j |  v
---+---+-----
 0 | 0 | 3.3
 0 | 1 | 5.5
 1 | 1 | 2.2

（顺序不重要）

我能想到的最好的 NumPy 是丑陋的，我不确定我是否正确地订购了结果（尽管它似乎在这里工作）：

# Get unique groups, index and inverse
u_ij, idx_ij, inv_ij = np.unique(ijv[['I', 'J']], return_index=True, return_inverse=True)
# Assemble aggregate
a_ijv = np.zeros(len(u_ij), ijv_dtype)
a_ijv['I'] = u_ij['I']
a_ijv['J'] = u_ij['J']
a_ijv['v'] = [ijv['v'][inv_ij == i].sum() for i in range(len(u_ij))]
print(a_ijv)  # [(0, 0, 3.3) (0, 1, 5.5) (1, 1, 2.2)]

我想认为有更好的方法来做到这一点！我正在使用 NumPy 1.4.1。

score 1 · Accepted Answer

numpy对于这样的任务来说有点太低级了。如果您必须使用 pure numpy，我认为您的解决方案很好，但如果您不介意使用具有更高抽象级别的东西，请尝试pandas：

import pandas as pd

df = pd.DataFrame({
    'I': (0, 0, 0, 1),
    'J': (0, 1, 1, 1),
    'v': (3.3, 1.1, 4.4, 2.2)})

print(df)
print(df.groupby(['I', 'J']).sum())

输出：

   I  J    v
0  0  0  3.3
1  0  1  1.1
2  0  1  4.4
3  1  1  2.2
       v
I J     
0 0  3.3
  1  5.5
1 1  2.2

score 1 · Accepted Answer

与您已经拥有的相比，这并不是一个巨大的进步，但它至少摆脱了 for 循环。

# Starting with your original setup

# Get the unique ij values and the mapping from ungrouped to grouped.
u_ij, inv_ij = np.unique(ijv[['I', 'J']], return_inverse=True)

# Create a totals array. You could do the fancy ijv_dtype thing if you wanted.
totals = np.zeros_like(u_ij.shape)

# Here's the magic bit. You can think of it as 
# totals[inv_ij] += ijv["v"] 
# except the above doesn't behave as expected sadly.
np.add.at(totals, inv_ij, ijv["v"])

print(totals)

您正在使用 numpy 的多 dtype 事物这一事实表明您应该使用 pandas。在尝试将is、js 和vs 保持在一起时，它通常会减少尴尬的代码。

python - 如何聚合 NumPy 记录数组（总和、最小值、最大值等）？

2 回答 2

Related

Reference