python-polars - 使用组内表达式过滤 DataFrame

Question

假设我已经有一个谓词表达式，如何使用该谓词进行过滤，但仅在组内应用它？例如，谓词可能是保持所有行等于最大值或在一个组内。（如果有平局，则可以在一个组中保留多行。）

以我的 dplyr 经验，我认为我可以只是.groupby然后.filter，但这不起作用。

import polars as pl
df = pl.DataFrame(dict(x=[0, 0, 1, 1], y=[1, 2, 3, 3]))
expression = pl.col("y") == pl.col("y").max()

df.groupby("x").filter(expression)
# AttributeError: 'GroupBy' object has no attribute 'filter'

然后我认为我可以应用于.over表达式，但这也不起作用。

import polars as pl
df = pl.DataFrame(dict(x=[0, 0, 1, 1], y=[1, 2, 3, 3]))
expression = pl.col("y") == pl.col("y").max()

df.filter(expression.over("x"))
# RuntimeError: Any(ComputeError("this binary expression is not an aggregation:
# [(col(\"y\")) == (col(\"y\").max())]
# pherhaps you should add an aggregation like, '.sum()', '.min()', '.mean()', etc.
# if you really want to collect this binary expression, use `.list()`"))

对于这个特定问题，我可以调用.over，max但我不知道如何将其应用于我无法控制的任意谓词。

import polars as pl
df = pl.DataFrame(dict(x=[0, 0, 1, 1], y=[1, 2, 3, 3]))
expression = pl.col("y") == pl.col("y").max().over("x")
df.filter(expression)
# shape: (3, 2)
# ┌─────┬─────┐
# │ x   ┆ y   │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 0   ┆ 2   │
# ├╌╌╌╌╌┼╌╌╌╌╌┤
# │ 1   ┆ 3   │
# ├╌╌╌╌╌┼╌╌╌╌╌┤
# │ 1   ┆ 3   │
# └─────┴─────┘

score 1 · Accepted Answer

如果你已经更新到polars>=0.13.0你的第二次尝试会奏效。:)

df = pl.DataFrame(dict(
    x=[0, 0, 1, 1], 
    y=[1, 2, 3, 3])
)

df.filter((pl.col("y") == pl.max("y").over("x")))

shape: (3, 2)
┌─────┬─────┐
│ x   ┆ y   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 0   ┆ 2   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 1   ┆ 3   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 1   ┆ 3   │
└─────┴─────┘

python-polars - 使用组内表达式过滤 DataFrame

1 回答 1

Related

Reference