python - NumPy 中结构化数组组的包含测试

Question

给定一个 ndarray 元组和一个参考数据列表，我正在寻找一种有效的方法来生成一个 ( numpy.isin) 映射 groupby 每个元组的第一个元素的列表的 ndarray。请参见以下示例

initial_list是一个ndarray输入np.loadtxt：

initial_list = np.loadtxt("data.txt",dtype={'names': ("item", "value"),'formats': ['U13', 'i8']},delimiter='    ', skiprows=1)
# initial_list = [(x,2) (x,51) (x,3) (y,11) (x,5) (z,44) (y,3) (z,2)]

reference_data = [2,3,5,11,44,51,70]

预期输出：

[[1,1,1,0,0,1,0]  #x
 [0,1,0,1,0,0,0]  #y
 [1,0,0,0,1,0,0]] #z

我知道我可以通过纯 Python 迭代来实现这一点。内置 NumPy 有什么有效的方法吗？类似于熊猫数据框groupby功能的东西。我的目标是未来的 Jaccard 指数计算。

Python迭代方法：

item_dict = {}
result = []

for item in initial_list:
    if item[0] not in item_dict:
        item_dict[item[0]] = [item[1]]
    else:
        item_dict[item[0]].append(item[1])
        item_dict[item[0]] = sorted(item_dict[item[0]])
print(item_dict) #{'x': [2, 3, 5, 51], 'y': [3, 11], 'z': [2, 44]}

for item in item_dict.keys():
    result.append([1 if x in item_dict[item]  else 0 for x in reference_data])
[print(i) for i in result]

#result=
#[[1, 1, 1, 0, 0, 1, 0],
#[0, 1, 0, 1, 0, 0, 0],
#[1, 0, 0, 0, 1, 0, 0]]

非常感谢提前

score 0 · Accepted Answer

NumPy 当前不提供groupby功能（请参阅此 GitHub 问题）。正如您已经知道的那样，您可以使用 pandas 来代替，这使得这种分组操作更加容易。如果您对原生 NumPy 解决方案感兴趣，我建议您采用以下方法：

import numpy as np

a = np.array([('x', 2), ('x', 51), ('x', 3), ('y', 11), 
              ('x', 5), ('z', 44), ('y', 3), ('z', 2)],
             dtype=[('item', 'U13'), ('value', 'i8')])
reference_data = np.array([2, 3, 5, 11, 44, 51, 70])
group_keys = np.unique(a['item'])  # array(['x', 'y', 'z'], dtype='<U13')
result = []
for key in group_keys:
    values = a[a['item'] == key]['value']
    result.append(np.isin(reference_data, values).astype(np.int))
print(np.stack(result))
# array([[1, 1, 1, 0, 0, 1, 0],
#        [0, 1, 0, 1, 0, 0, 0],
#        [1, 0, 0, 0, 1, 0, 0]])

在这里，我遍历唯一键（项目），使用boolean indexing选择相应的值组，然后检查的值reference_data是否在这些组中。作为最后一步，我使用np.stack.

python - NumPy 中结构化数组组的包含测试

1 回答 1

Related

Reference