python - 在 Pandas 中使用 group by 时如何将“first”和“last”函数应用于列？

Question

我有一个数据框，我想按特定列（或者换句话说，按特定列中的值）对它进行分组。我可以通过以下方式做到这一点：grouped = df.groupby(['ColumnName']).

我把这个操作的结果想象成一个表格，其中一些单元格可以包含一组值而不是单个值。为了得到一个普通的表格（即每个单元格只包含一个值的表格），我需要指出我想使用什么函数将单元格中的值集转换为单个值。

例如，我可以用它们的总和，或者用它们的最小值或最大值替换一组值。我可以通过以下方式做到这一点：grouped.sum()或者grouped.min()等等。

现在我想对不同的列使用不同的函数。我发现我可以通过以下方式做到这一点：grouped.agg({'ColumnName1':sum, 'ColumnName2':min}).

但是，由于某些原因，我无法使用first. 更详细地说，grouped.first()有效，但grouped.agg({'ColumnName1':first, 'ColumnName2':first})无效。结果我得到一个 NameError: NameError: name 'first' is not defined。所以，我的问题是：为什么会发生以及如何解决这个问题。

添加

在这里，我找到了以下示例：

grouped['D'].agg({'result1' : np.sum, 'result2' : np.mean})

可能我也需要使用np吗？但在我的情况下，python 无法识别“np”。我应该导入它吗？

score 58 · Accepted Answer

我认为问题在于有两种不同first的方法共享一个名称但行为不同，一种用于groupby 对象，另一种用于 Series/DataFrame（与时间序列有关）。

To replicate the behaviour of the groupby first method over a DataFrame using agg you could use iloc[0] (which gets the first row in each group (DataFrame/Series) by index):

grouped.agg(lambda x: x.iloc[0])

For example:

In [1]: df = pd.DataFrame([[1, 2], [3, 4]])

In [2]: g = df.groupby(0)

In [3]: g.first()
Out[3]: 
   1
0   
1  2
3  4

In [4]: g.agg(lambda x: x.iloc[0])
Out[4]: 
   1
0   
1  2
3  4

Analogously you can replicate last using iloc[-1].

Note: This will works column-wise, et al:

g.agg({1: lambda x: x.iloc[0]})

In older version of pandas you could would use the irow method (e.g. x.irow(0), see previous edits.

A couple of updated notes:

This is better done using the nth groupby method, which is much faster >=0.13:

g.nth(0)  # first
g.nth(-1)  # last

You have to take care a little, as the default behaviour for first and last ignores NaN rows... and IIRC for DataFrame groupbys it was broken pre-0.13... there's a dropna option for nth.

You can use the strings rather than built-ins (though IIRC pandas spots it's the sum builtin and applies np.sum):

grouped['D'].agg({'result1' : "sum", 'result2' : "mean"})

score 30 · Accepted Answer

Instead of using first or last, use their string representations in the agg method. For example on the OP's case:

grouped = df.groupby(['ColumnName'])
grouped['D'].agg({'result1' : np.sum, 'result2' : np.mean})

#you can do the string representation for first and last
grouped['D'].agg({'result1' : 'first', 'result2' : 'last'})

score 0 · Accepted Answer

我不确定这是否真的是问题所在，但是sum是minPython 内置插件，它以一些可迭代对象作为输入，而first它是 pandas Series 对象的一种方法，所以它可能不在您的命名空间中。此外，它需要其他东西作为输入（文档说一些偏移值）。

我想解决它的一种方法是创建自己的first函数，并将其定义为将 Series 对象作为输入，例如：

def first(Series, offset):
    return Series.first(offset)

或类似的东西..

score 0 · Accepted Answer

I would use a custom aggregator as shown below.

d = pd.DataFrame([[1,"man"], [1, "woman"], [1, "girl"], [2,"man"], [2, "woman"]],columns = 'number family'.split())
d

Here is the output:

    number family
 0       1    man
 1       1  woman
 2       1   girl
 3       2    man
 4       2  woman

Now the Aggregation taking first and last elements.

d.groupby(by = "number").agg(firstFamily= ('family', lambda x: list(x)[0]), lastFamily =('family', lambda x: list(x)[-1]))

The output of this aggregation is shown below.

       firstFamily lastFamily
number                       
1              man       girl
2              man      woman

I hope this helps.

score -3 · Accepted Answer

c_df = b_df.groupby('time').agg(first_x=('x', lambda x: list(x)[0]),
                                last_x=('x', lambda x: list(x)[-1]),
                                last_y=('y', lambda x: list(x)[-1]))

python - 在 Pandas 中使用 group by 时如何将“first”和“last”函数应用于列？

5 回答 5

Related

Reference