我遇到了以下对行标题和列标题进行排序的问题。
以下是如何重现这一点:
X =pd.DataFrame(dict(x=np.random.normal(size=100), y=np.random.normal(size=100)))
A=pd.qcut(X['x'], [0,0.25,0.5,0.75,1.0]) #create a factor
B=pd.qcut(X['y'], [0,0.25,0.5,0.75,1.0]) # create another factor
g = X.groupby([A,B])['x'].mean() #do a two-way bucketing
print g
#this gives the following and so far so good
x y
[-2.315, -0.843] [-2.58, -0.567] -1.041167
(-0.567, 0.0321] -1.722926
(0.0321, 0.724] -1.245856
(0.724, 3.478] -1.240876
(-0.843, -0.228] [-2.58, -0.567] -0.576264
(-0.567, 0.0321] -0.501709
(0.0321, 0.724] -0.522697
(0.724, 3.478] -0.506259
(-0.228, 0.382] [-2.58, -0.567] 0.175768
(-0.567, 0.0321] 0.214353
(0.0321, 0.724] 0.113650
(0.724, 3.478] -0.013758
(0.382, 2.662] [-2.58, -0.567] 0.983807
(-0.567, 0.0321] 1.214640
(0.0321, 0.724] 0.808608
(0.724, 3.478] 1.515334
Name: x, dtype: float64
#Now let's make a two way table and here is the problem:
HTML(g.unstack().to_html())
由此可见:
y (-0.567, 0.0321] (0.0321, 0.724] (0.724, 3.478] [-2.58, -0.567]
x
(-0.228, 0.382] 0.214353 0.113650 -0.013758 0.175768
(-0.843, -0.228] -0.501709 -0.522697 -0.506259 -0.576264
(0.382, 2.662] 1.214640 0.808608 1.515334 0.983807
[-2.315, -0.843] -1.722926 -1.245856 -1.240876 -1.041167
请注意标题不再排序的方式。我想知道什么是解决这个问题的好方法,以便使交互工作变得容易。
要进一步追踪问题出在哪里,请运行以下命令:
g.unstack().columns
它给了我这个: Index([(-0.567, 0.0321], (0.0321, 0.724], (0.724, 3.478], [-2.58, -0.567]], dtype=object)
现在将其与 B.levels 进行比较:
B.levels
Index([[-2.58, -0.567], (-0.567, 0.0321], (0.0321, 0.724], (0.724, 3.478]], dtype=object)
显然Factor中原来的顺序丢失了。
现在更糟糕的是,让我们做一个多级交叉表:
g2 = X.groupby([A,B]).agg('mean')
g3 = g2.stack().unstack(-2)
HTML(g3.to_html())
它显示如下:
y (-0.567, 0.0321] (0.0321, 0.724] (0.724, 3.478]
x
(-0.228, 0.382] x 0.214353 0.113650 -0.013758
y -0.293465 0.321836 1.180369
(-0.843, -0.228] x -0.501709 -0.522697 -0.506259
y -0.204811 0.324571 1.167005
(0.382, 2.662] x 1.214640 0.808608 1.515334
y -0.195446 0.161198 1.074532
[-2.315, -0.843] x -1.722926 -1.245856 -1.240876
y -0.392896 0.335471 1.730513
行标签和列标签都排序不正确。
谢谢。