python - python pandas groupby/apply：传递给apply函数的究竟是什么？

Question

Python新手在这里。我试图了解 pandas groupby 和 apply 方法的工作原理。我找到了这个简单的例子，我把它贴在下面：

import pandas as pd

ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
   'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
   'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
   'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
   'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}

df = pd.DataFrame(ipl_data)

数据框df如下所示：

      Team  Rank  Year  Points
0   Riders     1  2014     876
1   Riders     2  2015     789
2   Devils     2  2014     863
3   Devils     3  2015     673
4    Kings     3  2014     741
5    kings     4  2015     812
6    Kings     1  2016     756
7    Kings     1  2017     788
8   Riders     2  2016     694
9   Royals     4  2014     701
10  Royals     1  2015     804
11  Riders     2  2017     690

到现在为止还挺好。然后我想转换我的数据，以便从每组团队中我只保留点列中的第一个元素。首先检查df['Points'][0]确实给了我的第一个Points元素df，我尝试了这个：

df.groupby('Team').apply(lambda x : x['Points'][0])

认为函数的参数x是lambda另一个熊猫数据框。但是，python 会产生错误：

File "pandas/_libs/index.pyx", line 81, in pandas._libs.index.IndexEngine.get_value
File "pandas/_libs/index.pyx", line 89, in pandas._libs.index.IndexEngine.get_value
File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 987, in pandas._libs.hashtable.Int64HashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 993, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 0

这似乎与 HashTable 有关，但我无法理解为什么。然后我想也许传递给的lambda不是数据框，所以我运行了这个：

df.groupby('Team').apply(lambda x : (type(x), x.shape))

输出：

Team
Devils    (<class 'pandas.core.frame.DataFrame'>, (2, 4))
Kings     (<class 'pandas.core.frame.DataFrame'>, (3, 4))
Riders    (<class 'pandas.core.frame.DataFrame'>, (4, 4))
Royals    (<class 'pandas.core.frame.DataFrame'>, (2, 4))
kings     (<class 'pandas.core.frame.DataFrame'>, (1, 4))
dtype: object

其中，IIUC 表明的论点lambda确实是一个熊猫数据框，其中包含每个团队的df.

我知道我可以通过运行得到想要的结果：

df.groupby('Team').apply(lambda x : x['Points'].iloc[0])

我只是想从 apply 函数中了解为什么df['Points'][0]有效而x['Points'][0]不是无效。感谢您的阅读！

score 5 · Accepted Answer

当您打电话时，df.groupby('Team').apply(lambda x: ...)您实际上是在按 Team 分割数据帧并将每个块传递给 lambda 函数：

      Team  Rank  Year  Points
0   Riders     1  2014     876
1   Riders     2  2015     789
8   Riders     2  2016     694
11  Riders     2  2017     690
------------------------------
2   Devils     2  2014     863
3   Devils     3  2015     673
------------------------------
4    Kings     3  2014     741
6    Kings     1  2016     756
7    Kings     1  2017     788
------------------------------
5    kings     4  2015     812
------------------------------
9   Royals     4  2014     701
10  Royals     1  2015     804

df['Points'][0]有效，因为您告诉熊猫“获取Points系列标签 0 处的值”，它存在。

.apply(lambda x: x['Points'][0])不起作用，因为只有 1 个块 ( Riders) 的标签为 0。因此您会收到密钥错误。

话虽如此，apply它是通用的，因此与内置的矢量化聚合函数相比它相当慢。您可以使用first：

df.groupby('Team')['Points'].first()

score 1 · Accepted Answer

Apply 函数获取每一行并处理数据，因此 Apply 真的不理解您传递给它的索引（如 [0]），因此会出现错误。它适用于 df，因为索引仍然适用于 df。

您可以尝试这样的方法来实现每个团队的第一点。

df.drop_duplicates(subset=['Team'])

输出：

    Team    Rank    Year    Points
0   Riders  1   2014    876
2   Devils  2   2014    863
4   Kings   3   2014    741
5   kings   4   2015    812
9   Royals  4   2014    701

如果您需要保留最大/最小点行，您可以在删除重复项之前对 df 进行排序。希望有帮助。

score 0 · Accepted Answer

对于标题问题，

agroupby = df.groupby(...)
help( agroupby.apply )  # or in IPython xx.<tab> for help( xx )

pandas.core.groupby.generic.DataFrameGroupBy 实例的 apply(func, *args, **kwargs) 方法

按组应用函数func并将结果组合在一起。

传递给的函数apply必须将数据框作为其第一个参数并返回数据框、系列或标量。apply然后将负责将结果重新组合成一个数据框或系列。

python - python pandas groupby/apply：传递给apply函数的究竟是什么？

3 回答 3

Related

Reference