python - pandas - 具有非数字值的数据透视表？（DataError：没有要聚合的数字类型）

Question

我正在尝试对包含字符串作为结果的表进行透视。

import pandas as pd

df1 = pd.DataFrame({'index' : range(8),
'variable1' : ["A","A","B","B","A","B","B","A"],
'variable2' : ["a","b","a","b","a","b","a","b"],
'variable3' : ["x","x","x","y","y","y","x","y"],
'result': ["on","off","off","on","on","off","off","on"]})

df1.pivot_table(values='result',rows='index',cols=['variable1','variable2','variable3'])

但我得到：DataError: No numeric types to aggregate。

当我将结果值更改为数字时，这将按预期工作：

df2 = pd.DataFrame({'index' : range(8),
'variable1' : ["A","A","B","B","A","B","B","A"],
'variable2' : ["a","b","a","b","a","b","a","b"],
'variable3' : ["x","x","x","y","y","y","x","y"],
'result': [1,0,0,1,1,0,0,1]})

df2.pivot_table(values='result',rows='index',cols=['variable1','variable2','variable3'])

我得到了我需要的东西：

variable1   A               B    
variable2   a       b       a   b
variable3   x   y   x   y   x   y
index                            
0           1 NaN NaN NaN NaN NaN
1         NaN NaN   0 NaN NaN NaN
2         NaN NaN NaN NaN   0 NaN
3         NaN NaN NaN NaN NaN   1
4         NaN   1 NaN NaN NaN NaN
5         NaN NaN NaN NaN NaN   0
6         NaN NaN NaN NaN   0 NaN
7         NaN NaN NaN   1 NaN NaN

我知道我可以将字符串映射到数值然后反转操作，但也许有更优雅的解决方案？

score 26 · Accepted Answer

我最初的回复是基于 Pandas 0.14.1，从那时起，pivot_table 函数中的许多事情发生了变化（行 --> 索引、列 --> 列...）

此外，我发布的原始 lambda 技巧似乎不再适用于 Pandas 0.18。你必须提供一个减少函数（即使它是最小值、最大值或平均值）。但即使这样似乎也不合适——因为我们没有减少数据集，只是转换它......所以我更加努力地查看 unstack......

import pandas as pd

df1 = pd.DataFrame({'index' : range(8),
'variable1' : ["A","A","B","B","A","B","B","A"],
'variable2' : ["a","b","a","b","a","b","a","b"],
'variable3' : ["x","x","x","y","y","y","x","y"],
'result': ["on","off","off","on","on","off","off","on"]})

# these are the columns to end up in the multi-index columns.
unstack_cols = ['variable1', 'variable2', 'variable3']

首先，使用索引 + 要堆叠的列对数据设置索引，然后使用级别 arg 调用 unstack。

df1.set_index(['index'] + unstack_cols).unstack(level=unstack_cols)

生成的数据框如下。

score 2 · Accepted Answer

我认为最好的折衷方案是用 True/False 替换 on/off，这将使 pandas 能够更好地“理解”数据并以智能的、预期的方式行事。

df2 = df1.replace({'on': True, 'off': False})

你在你的问题中基本上承认了这一点。我的回答是，我认为没有更好的方法，无论接下来发生什么，你都应该替换 'on'/'off'。

正如 Andy Hayden 在评论中指出的那样，如果将 on/off 替换为 1/0，您将获得更好的性能。

python - pandas - 具有非数字值的数据透视表？（DataError：没有要聚合的数字类型）

2 回答 2

Related

Reference