python - Pandas 中的多索引排序

Question

我有一个通过 groupby 操作创建的多索引 DataFrame。我正在尝试使用多个级别的索引进行复合排序，但我似乎找不到满足我需要的排序函数。

初始数据集看起来像这样（各种产品的每日销售额）：

         Date Manufacturer Product Name Product Launch Date  Sales
0  2013-01-01        Apple         iPod          2001-10-23     12
1  2013-01-01        Apple         iPad          2010-04-03     13
2  2013-01-01      Samsung       Galaxy          2009-04-27     14
3  2013-01-01      Samsung   Galaxy Tab          2010-09-02     15
4  2013-01-02        Apple         iPod          2001-10-23     22
5  2013-01-02        Apple         iPad          2010-04-03     17
6  2013-01-02      Samsung       Galaxy          2009-04-27     10
7  2013-01-02      Samsung   Galaxy Tab          2010-09-02      7

我使用 groupby 来获得日期范围内的总和：

> grouped = df.groupby(['Manufacturer', 'Product Name', 'Product Launch Date']).sum()
                                               Sales
Manufacturer Product Name Product Launch Date       
Apple        iPad         2010-04-03              30
             iPod         2001-10-23              34
Samsung      Galaxy       2009-04-27              24
             Galaxy Tab   2010-09-02              22

到目前为止，一切都很好！

现在我要做的最后一件事是按发布日期对每个制造商的产品进行排序，但将它们按层次分组在制造商下 - 这就是我想做的所有事情：

                                               Sales
Manufacturer Product Name Product Launch Date       
Apple        iPod         2001-10-23              34
             iPad         2010-04-03              30
Samsung      Galaxy       2009-04-27              24
             Galaxy Tab   2010-09-02              22

当我尝试 sortlevel() 时，我失去了以前拥有的每个公司的良好层次结构：

> grouped.sortlevel('Product Launch Date')
                                               Sales
Manufacturer Product Name Product Launch Date       
Apple        iPod         2001-10-23              34
Samsung      Galaxy       2009-04-27              24
Apple        iPad         2010-04-03              30
Samsung      Galaxy Tab   2010-09-02              22

sort() 和 sort_index() 只是失败：

grouped.sort(['Manufacturer','Product Launch Date'])
KeyError: u'no item named Manufacturer'

grouped.sort_index(by=['Manufacturer','Product Launch Date'])
KeyError: u'no item named Manufacturer'

似乎是一个简单的操作，但我无法完全弄清楚。

我不喜欢为此使用 MultiIndex，但由于这是 groupby() 返回的，这就是我一直在使用的。

顺便说一句，生成初始 DataFrame 的代码是：

data = {
  'Date': ['2013-01-01', '2013-01-01', '2013-01-01', '2013-01-01', '2013-01-02', '2013-01-02', '2013-01-02', '2013-01-02'],
  'Manufacturer' : ['Apple', 'Apple', 'Samsung', 'Samsung', 'Apple', 'Apple', 'Samsung', 'Samsung',],
  'Product Name' : ['iPod', 'iPad', 'Galaxy', 'Galaxy Tab', 'iPod', 'iPad', 'Galaxy', 'Galaxy Tab'], 
  'Product Launch Date' : ['2001-10-23', '2010-04-03', '2009-04-27', '2010-09-02','2001-10-23', '2010-04-03', '2009-04-27', '2010-09-02'],
  'Sales' : [12, 13, 14, 15, 22, 17, 10, 7]
}
df = DataFrame(data, columns=['Date', 'Manufacturer', 'Product Name', 'Product Launch Date', 'Sales'])

score 10 · Accepted Answer

一个技巧是改变关卡的顺序：

In [11]: g
Out[11]:
                                               Sales
Manufacturer Product Name Product Launch Date
Apple        iPad         2010-04-03              30
             iPod         2001-10-23              34
Samsung      Galaxy       2009-04-27              24
             Galaxy Tab   2010-09-02              22

In [12]: g.index = g.index.swaplevel(1, 2)

Sortlevel，（如您所见）按顺序对 MultiIndex 级别进行排序：

In [13]: g = g.sortlevel()

并换回：

In [14]: g.index = g.index.swaplevel(1, 2)

In [15]: g
Out[15]:
                                               Sales
Manufacturer Product Name Product Launch Date
Apple        iPod         2001-10-23              34
             iPad         2010-04-03              30
Samsung      Galaxy       2009-04-27              24
             Galaxy Tab   2010-09-02              22

我认为 sortlevel 不应该按顺序对剩余的标签进行排序，因此会创建一个 github 问题。:) 虽然值得一提的是关于“排序需要”的文档。

swaplevel注意：您可以通过重新排序初始 groupby 的顺序来避免第一个：

g = df.groupby(['Manufacturer', 'Product Launch Date', 'Product Name']).sum()

score 6 · Accepted Answer

这一个班轮对我有用：

In [1]: grouped.sortlevel(["Manufacturer","Product Launch Date"], sort_remaining=False)

                                               Sales
Manufacturer Product Name Product Launch Date       
Apple        iPod         2001-10-23              34
             iPad         2010-04-03              30
Samsung      Galaxy       2009-04-27              24
             Galaxy Tab   2010-09-02              22

请注意，这也有效：

groups.sortlevel([0,2], sort_remaining=False)

当您最初在两年前发布时，这不会起作用，因为 sortlevel 默认情况下会按所有索引排序，这会破坏您的公司层次结构。去年添加了禁用该行为的sort_remaining 。这是供参考的提交链接：https ://github.com/pydata/pandas/commit/3ad64b11e8e4bef47e3767f1d31cc26e39593277

score 4 · Accepted Answer

要按“索引列”（又名级别）对 MultiIndex 进行排序，您需要使用该.sort_index()方法并设置其level参数。如果要按多个级别排序，则需要将参数设置为按顺序排列的级别名称列表。

这应该为您提供所需的 DataFrame：

df.groupby(['Manufacturer',
            'Product Name', 
            'Launch Date']
          ).sum().sort_index(level=['Manufacturer','Launch Date'])

score 0 · Accepted Answer

如果您想尝试避免在非常深的 MultiIndex 中进行多次交换，您也可以尝试这样做：

按 X 级切片（按列表理解 + .loc + IndexSlice）
对所需级别进行排序（sortlevel(2)）
连接每组 X 级索引

这里有代码：

import pandas as pd
idx = pd.IndexSlice
g = pd.concat([grouped.loc[idx[i,:,:],:].sortlevel(2) for i in grouped.index.levels[0]])
g

score 0 · Accepted Answer

如果您不关心保存索引（我通常更喜欢任意整数索引），您可以使用以下单行：

grouped.reset_index().sort(["Manufacturer","Product Launch Date"])

python - Pandas 中的多索引排序

5 回答 5

Related

Reference