虽然指定的重复numpy:数组中唯一值的最有效频率计数是相关的,但我认为它忽略了此代码中的一些重要问题。接受的答案bincount
可能没有用。unique
return_counts
在应用更新的答案 时,你们许多人需要一些帮助。
一个测试脚本:
import numpy as np
# simple data array
data=np.zeros((20,),dtype=[('brand_id',int),('id',int)])
data['brand_id']=np.random.randint(0,10,20)
data['id']=np.arange(20)
items_sold_per_brand = np.empty(len(data), dtype=[('brand_id', 'int'), ('count', 'int')])
brands = np.unique(data['brand_id'])
print('brands',brands)
for i, brand in enumerate(np.nditer(brands)):
items_sold_per_brand[i] = (brand, len(data[data['brand_id'] == brand]))
top5 = np.sort(items_sold_per_brand, order='count')[-5:]
print('top5',top5[::-1])
# a bit of simplification
brandids = data['brand_id']
brands = np.unique(brandids)
# only need space for the unique ids
items_sold_per_brand = np.zeros(len(brands), dtype=[('brand_id', 'int'), ('count', 'int')])
items_sold_per_brand['brand_id'] = brands
for i, brand in enumerate(brands): # dont need nditer
items_sold_per_brand['count'][i] = (brandids == brand).sum()
top5 = np.sort(items_sold_per_brand, order='count')[-5:]
print('top5',top5[::-1])
brands,counts = np.unique(data['brand_id'],return_counts=True)
print('counts',counts)
items_sold_per_brand = np.empty(len(brands), dtype=[('brand_id', 'int'), ('count', 'int')])
items_sold_per_brand['brand_id']=brands
items_sold_per_brand['count']=counts
tops = np.sort(items_sold_per_brand, order='count')[::-1]
print('tops',tops)
ii = np.bincount(data['brand_id'])
print('bin',ii)
生产
1030:~/mypy$ python3 stack38091849.py
brands [0 2 3 4 5 6 7 9]
top5 [(99072, 1694566490) (681217, 1510016618) (1694566234, 1180958979)
(147063168, 147007976) (-1225886932, 139383040)]
top5 [(7, 4) (2, 4) (0, 3) (9, 2) (6, 2)]
counts [3 4 2 1 2 2 4 2]
tops [(7, 4) (2, 4) (0, 3) (9, 2) (6, 2) (5, 2) (3, 2) (4, 1)]
bin [3 0 4 2 1 2 2 4 0 2]
用空初始化items_sold_per_brand
和data
潜在的大小会留下在迭代期间不会被覆盖的随机计数。 zeros
较小的brands
尺寸可以解决这个问题。
nditer
像这样的简单迭代不需要。
bincount
速度很快,但会为范围内的所有潜在值创建 bin data
。所以可能有 0 个大小的箱子。