python - numpy recarray 上的高效 GROUP BY 查询

Question

我有一个包含 6 列的产品购买日志数据集： purchase_date、user_address、user_id、product_id、brand_id、retailer_id。所有都包含整数，除了 user_address 是一个字符串。

我需要获得在整个数据集中销售最多商品的前 5 个品牌，即数据中条目最多的品牌。

在 SQL 中，我相信它会如下所示（如果我错了，请纠正我）：

SELECT brand_id, COUNT(*)
FROM data
GROUP BY brand_id

我尝试在 python 中使用 numpy recarray 执行此操作，如下所示：

items_sold_per_brand = np.empty(len(data), dtype=[('brand_id', 'int'), ('count', 'int')])

brands = np.unique(data['brand_id'])    # array of unique brands
for i, brand in enumerate(np.nditer(brands)):     # For any unique brand
    items_sold_per_brand[i] = (brand, len(data[data['brand_id'] == brand]))    # get the number of rows with the current brand

top5 = np.sort(items_sold_per_brand, order='count')[-5:]    # sort the array over the count values
print(top5[::-1])    # print the last five entries

它可以工作，除了在大约 12000 个不同品牌的约 100000 行数据集上运行大约需要 15 秒，这似乎太长了。for 循环是最耗时的。

通过使用 numpy 的 recarray 查询方法，是否有更优雅和有效的方法来做到这一点？

谢谢你的帮助！

score 0 · Accepted Answer

虽然指定的重复numpy：数组中唯一值的最有效频率计数是相关的，但我认为它忽略了此代码中的一些重要问题。接受的答案bincount可能没有用。unique return_counts在应用更新的答案时，你们许多人需要一些帮助。

一个测试脚本：

import numpy as np
# simple data array
data=np.zeros((20,),dtype=[('brand_id',int),('id',int)])
data['brand_id']=np.random.randint(0,10,20)
data['id']=np.arange(20)

items_sold_per_brand = np.empty(len(data), dtype=[('brand_id', 'int'), ('count', 'int')])
brands = np.unique(data['brand_id']) 
print('brands',brands)
for i, brand in enumerate(np.nditer(brands)):
    items_sold_per_brand[i] = (brand, len(data[data['brand_id'] == brand]))
top5 = np.sort(items_sold_per_brand, order='count')[-5:]    
print('top5',top5[::-1]) 

# a bit of simplification
brandids = data['brand_id']
brands = np.unique(brandids)
# only need space for the unique ids 
items_sold_per_brand = np.zeros(len(brands), dtype=[('brand_id', 'int'), ('count', 'int')])
items_sold_per_brand['brand_id'] = brands
for i, brand in enumerate(brands):  # dont need nditer
    items_sold_per_brand['count'][i] = (brandids == brand).sum()
top5 = np.sort(items_sold_per_brand, order='count')[-5:]    
print('top5',top5[::-1])    

brands,counts = np.unique(data['brand_id'],return_counts=True)
print('counts',counts)

items_sold_per_brand = np.empty(len(brands), dtype=[('brand_id', 'int'), ('count', 'int')])
items_sold_per_brand['brand_id']=brands
items_sold_per_brand['count']=counts
tops = np.sort(items_sold_per_brand, order='count')[::-1]
print('tops',tops)

ii = np.bincount(data['brand_id'])
print('bin',ii)

生产

1030:~/mypy$ python3 stack38091849.py 
brands [0 2 3 4 5 6 7 9]
top5 [(99072, 1694566490) (681217, 1510016618) (1694566234, 1180958979)
 (147063168, 147007976) (-1225886932, 139383040)]
top5 [(7, 4) (2, 4) (0, 3) (9, 2) (6, 2)]
counts [3 4 2 1 2 2 4 2]
tops [(7, 4) (2, 4) (0, 3) (9, 2) (6, 2) (5, 2) (3, 2) (4, 1)]
bin [3 0 4 2 1 2 2 4 0 2]

用空初始化items_sold_per_brand和data潜在的大小会留下在迭代期间不会被覆盖的随机计数。 zeros较小的brands尺寸可以解决这个问题。

nditer像这样的简单迭代不需要。

bincount速度很快，但会为范围内的所有潜在值创建 bin data。所以可能有 0 个大小的箱子。

python - numpy recarray 上的高效 GROUP BY 查询

1 回答 1

Related

Reference