我正在使用熊猫创建数据透视表。我的数据看起来通常包含许多可以很容易地与 np.mean 聚合的数值(例如question1),但有一个例外 - 净推荐值(请注意欧盟和北美的总计 0.00 )
responseId country region nps question1
0 1 Germany EU 11 3.2
1 2 Germany EU 10 5.0
2 3 US NA 7 4.3
3 4 US NA 5 4.8
4 5 France EU 5 3.2
5 6 France EU 5 5.0
6 7 France EU 11 5.0
region EU NA
country France Germany Total US Total
nps -33.33 100.0 0.00 -100.00 0.00
question1 4.40 4.1 4.25 4.55 4.55
对于 NPS,我使用自定义 aggfunc
def calculate_nps(column):
detractors = [1,2,3,4,5,6,7]
passives = [8,9]
promoters = [10,11]
counts = column.value_counts(normalize=True)
percent_promoters = counts.reindex(promoters).sum()
percent_detractors = counts.reindex(detractors).sum()
return (percent_promoters - percent_detractors) * 100
aggfunc = {
"nps": calculate_nps,
"question1": np.mean
}
pd.pivot_table(data=df,columns=["region","country"],values=["nps","question1"],aggfunc=aggfunc,margins=True,margins_name="Total",sort=True)
这个 aggfunc 对常规列工作正常,但对边距(“Total”列)失败,因为 pandas 传递已经聚合的数据。对于常规字段,calculate_nps接收这样的列
4 5
5 5
6 11
Name: nps, dtype: int64
但是对于边距,数据看起来像这样
region country
EU France -33.333333
Germany 100.000000
Name: nps, dtype: float64
calculate_nps无法处理此类数据并返回 0。在这种情况下,应该应用 column.mean() 我这样解决了(注意如果 column.index.names != [None])
def calculate_nps(column):
if column.index.names != [None]:
return column.mean()
detractors = [1,2,3,4,5,6,7]
passives = [8,9]
promoters = [10,11]
counts = column.value_counts(normalize=True)
percent_promoters = counts.reindex(promoters).sum()
percent_detractors = counts.reindex(detractors).sum()
return (percent_promoters - percent_detractors) * 100
现在数据透视表是正确的
region EU NA
country France Germany Total US Total
nps -33.33 100.0 33.33 -100.00 -100.00
question1 4.40 4.1 4.25 4.55 4.55
问题
是否有适当/更好的方法来确定传递给 aggfunc 的数据类型?我不确定我的解决方案是否适用于所有场景