python - 如何将 Top N 之外的“剩余”结果分组到 Pandas 的“其他”中

Question

当按一列对熊猫数据框进行分组时，说“版本”，它有 10 个不同的版本。如何绘制前 3 名（覆盖 90% 以上）并将剩余的小部分放入一个“其他”桶中。

data = array([
              ('Top1', 14),
              ('Top1', 3),
              ('Top1', 2),
              ('Top2', 6),
              ('Top2', 7),
              ('Other1', 1),
              ('Other2', 2),
         ], 
      dtype=[('Version', 'S10'),('Value', '<i4')])
df = DataFrame.from_records(data)
df.groupby('Version').sum()

这将返回：

Value
Version 
Other1   1
Other2   2
Top1     19
Top2     13

我在找

Value
Version 
Others   
Top1     19
Top2     13

版本名称 Other* 和 Top* 仅用于示例。

当然，这可以通过在分组并与阈值比较后手动将类别设置为“其他”来实现。我希望有一条捷径。

score 9 · Accepted Answer

我假设您还希望将Other组相加，例如，总共 3 个？

如果我的目标是赢得 Pandas 单线比赛，这将是我的参赛作品：

df.replace(df.groupby('Version').sum().sort('Value', ascending=False).index[2:], 'Other').groupby('Version').sum()

         Value
Version       
Other        3
Top1        19
Top2        13

但这完全不可读，所以让我们分解一下：

您已经展示了如何对每个组求和、对结果进行排序并选择前 2 名之外的任何内容，可以通过以下方式完成：

not_top2 = df.groupby('Version').sum().sort('Value', ascending=False).index[2:]

在此示例not_top2中包含Other1和Other2。

我们可以将它们替换Versions为通用名称：

dfnew  = df.replace(not_top2, 'Other')
print dfnew

  Version  Value
0    Top1     14
1    Top1      3
2    Top1      2
3    Top2      6
4    Top2      7
5   Other      1
6   Other      2

以上内容替换了not_top2任何列中的内容。如果您希望此值出现在除之外的任何其他列中，则需要一个小步骤Version。

剩下的就是再次进行原始分组：

dfnew.groupby('Version').sum()

这使：

         Value
Version       
Other        3
Top1        19
Top2        13

score 2 · Accepted Answer

# number of top-n you want
n = 2

# group by & sort descending
df_sorted = (df
                .groupby('Version').sum()
                .sort_values('Value', ascending=False)
                .reset_index()
            )

# rename rows other than top-n to 'Others'
df_sorted.loc[df_sorted.index >= n, 'Version'] = 'Others'

# re-group by again
df_sorted.groupby('Version').sum()

score 0 · Accepted Answer

使用值计数而不是 GroupBy。

# get top 3 versions (also keep the nan values)
versions_to_keep = df['Version'].value_counts(dropna=False)[:3].index

# set all other versions outside of top 3 versions as 'other'
df.loc[~df['Version'].isin(versions_to_keep)]['Version'] = 'Other'

python - 如何将 Top N 之外的“剩余”结果分组到 Pandas 的“其他”中

3 回答 3

Related

Reference