python - 获取分组数据框中列的最大计数

Question

我的数据框df是：

    Election Year   Votes   Party   Region
  0   2000           50      A       a
  1   2000           100     B       a
  2   2000           26      A       b
  3   2000           180     B       b
  4   2000           300     A       c
  5   2000           46      C       c
  6   2005           149     A       a
  7   2005           46      B       a
  8   2005           312     A       b
  9   2005           23      B       b
  10  2005           16      A       c
  11  2005           35      C       c

我想每年都获得党获胜的最大区域。所以想要的输出是：

 Election Year Party
   2000         B
   2005         A

我试过这段代码来获得上面的输出，但它给出了错误：

 winner = df.groupby(['Election Year'])['Votes'].max().reset_index()
 winner = winner.groupby('Election Year').first().reset_index()
 winner = winner[['Election Year', 'Party']].to_string(index=False)
 winner

如何获得所需的输出？

score 2 · Accepted Answer

这是嵌套 groupby 的一种方法。我们首先计算每个年份-区域对中的每党选票，然后用于mode找到赢得最多地区的政党。模式不必是唯一的（如果两方或多方赢得相同数量的地区）。

df.groupby(["Year", "Region"])\
  .apply(lambda gp: gp.groupby("Party").Votes.sum().idxmax())\
  .unstack().mode(1).rename(columns={0: "Party"})

     Party
Year      
2000     B
2005     A

要解决该评论，您可以将idxmax上面的内容替换为nlargest和diff以查找获胜利润低于给定数字的区域。

margin = df.groupby(["Year", "Region"])\
  .apply(lambda gp: gp.groupby("Party").Votes.sum().nlargest(2).diff()) > -125

print(margin[margin].reset_index()[["Year", "Region"]])

#    Year Region
# 0  2000      a
# 1  2005      a
# 2  2005      c

score 2 · Accepted Answer

您可以使用获取每组GroupBy.idxmax()的 max 的索引，然后使用定位行，然后选择所需的列，如下所示：VotesElection Year.loc

df.loc[df.groupby('Election Year')['Votes'].idxmax()][['Election Year', 'Party']]

结果：

   Election Year Party
4           2000     A
8           2005     A

编辑

如果我们要获得Party最多的胜利Region，我们可以使用以下代码（不使用slow .apply()with lambda函数）：

(df.loc[
    df.groupby(['Election Year', 'Region'])['Votes'].idxmax()]
    [['Election Year', 'Party', 'Region']]
    .pivot(index='Election Year', columns='Region')
    .mode(axis=1)
).rename({0: 'Party'}, axis=1).reset_index()

结果：

   Election Year Party
0           2000     B
1           2005     A

score 1 · Accepted Answer

尝试这个

winner = df.groupby(['Election Year','Party'])['Votes'].max().reset_index()
winner.drop('Votes', axis = 1, inplace = True)
winner

score 1 · Accepted Answer

另一种方法：（实际上已关闭@hilberts_drinking_problem）

>>> df.groupby(["Election Year", "Region"]) \
      .apply(lambda x: x.loc[x["Votes"].idxmax(), "Party"]) \
      .unstack().mode(axis="columns") \
      .rename(columns={0: "Party"}).reset_index()

   Election Year Party
0           2000     B
1           2005     A

score 0 · Accepted Answer

我相信一个班轮df.groupby(["Election Year"]).max().reset_index()['Election Year', 'Party']可以解决您的问题

python - 获取分组数据框中列的最大计数

5 回答 5

编辑

Related

Reference