python-polars - 依赖于上一行的惰性过滤器（Polars Python）

Question

我正在使用 Python Polars，我有一张这样的表格：

第 1 列	第 2 列
id1	1
id1	1
id1	2
id1	1
id1	1
id1	2
id1	3

我希望使用 Polars Lazy API 在 column2 前一个元素与实际 column2 元素不同时得到结果。所以操作后的结果是这样的：

第 1 列	第 2 列
id1	1
id1	2
id1	1
id1	2
id1	3

谢谢！

score 1 · Accepted Answer

使用shift表达式。

import polars as pl

df = pl.DataFrame(
    {"Column1": ["id1"] * 7, "Column2": [1, 1, 2, 1, 1, 2, 3]}).lazy()

df.filter(pl.col("Column2") != pl.col("Column2").shift(periods=1)).collect()

shape: (5, 2)
┌─────────┬─────────┐
│ Column1 ┆ Column2 │
│ ---     ┆ ---     │
│ str     ┆ i64     │
╞═════════╪═════════╡
│ id1     ┆ 1       │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ id1     ┆ 2       │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ id1     ┆ 1       │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ id1     ┆ 2       │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ id1     ┆ 3       │
└─────────┴─────────┘

您可以在此处找到有关选项的文档： https ://pola-rs.github.io/polars/py-polars/html/reference/api/polars.Expr.shift.html#polars.Expr.shift

请注意，您可以使用参数反转移位方向和计数periods。

还有一个版本shift_and_fill将填充None由于转变而创建的值。

score 1 · Accepted Answer

让我详细说明如何shift和shift_and_fill工作。这些的使用归结为策略（以及了解您的数据）。

使用shift

让我们从这个数据集开始：

import polars as pl
df = pl.DataFrame({"row_num": range(1, 8),
                   "Column2": [1, 2, 3, 3, 4, 5, 4]}).lazy()
df.collect()

shape: (7, 2)
┌─────────┬─────────┐
│ row_num ┆ Column2 │
│ ---     ┆ ---     │
│ i64     ┆ i64     │
╞═════════╪═════════╡
│ 1       ┆ 1       │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 2       ┆ 2       │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 3       ┆ 3       │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 4       ┆ 3       │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 5       ┆ 4       │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 6       ┆ 5       │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 7       ┆ 4       │
└─────────┴─────────┘

现在，让我们创建中间列来看看这些函数是如何工作的。

(df
    .with_column(pl.col("Column2").shift().alias("Column2_shifted"))
    .with_column((pl.col("Column2") != pl.col("Column2_shifted")).alias("not_eq_result"))
).collect()

shape: (7, 4)
┌─────────┬─────────┬─────────────────┬───────────────┐
│ row_num ┆ Column2 ┆ Column2_shifted ┆ not_eq_result │
│ ---     ┆ ---     ┆ ---             ┆ ---           │
│ i64     ┆ i64     ┆ i64             ┆ bool          │
╞═════════╪═════════╪═════════════════╪═══════════════╡
│ 1       ┆ 1       ┆ null            ┆ true          │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2       ┆ 2       ┆ 1               ┆ true          │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3       ┆ 3       ┆ 2               ┆ true          │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4       ┆ 3       ┆ 3               ┆ false         │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 5       ┆ 4       ┆ 3               ┆ true          │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 6       ┆ 5       ┆ 4               ┆ true          │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 7       ┆ 4       ┆ 5               ┆ true          │
└─────────┴─────────┴─────────────────┴───────────────┘

请注意，在第一行中，第一行中Column2_shifted有一个null（真的，None）值。

但更重要的是，结果(pl.col("Column2") != pl.col("Column2_shifted"))是True针对第一行的。

因此，只要null中不允许值，就会Column2包括第一行。您无需将数据集的第一行单独连接到结果。

注意：实际上，您不需要这些中间列。您可以简单地使用.filter(pl.col("Column2") != pl.col("Column2").shift()). 中间列仅用于说明目的。

使用shift_and_fill

如果None/null值在中是允许的Column2，那么您可以尝试使用shift_and_fill并选择一个fill_value在中不允许的值Column2。

例如，如果您知道中不允许负数Column2，则可以使用此逻辑。

(df
    .with_column(pl.col("Column2").shift_and_fill(periods=1, fill_value=-1).alias("Column2_shifted"))
    .with_column((pl.col("Column2") != pl.col("Column2_shifted")).alias("not_eq_result"))
).collect()

shape: (7, 4)
┌─────────┬─────────┬─────────────────┬───────────────┐
│ row_num ┆ Column2 ┆ Column2_shifted ┆ not_eq_result │
│ ---     ┆ ---     ┆ ---             ┆ ---           │
│ i64     ┆ i64     ┆ i64             ┆ bool          │
╞═════════╪═════════╪═════════════════╪═══════════════╡
│ 1       ┆ 1       ┆ -1              ┆ true          │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2       ┆ 2       ┆ 1               ┆ true          │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3       ┆ 3       ┆ 2               ┆ true          │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4       ┆ 3       ┆ 3               ┆ false         │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 5       ┆ 4       ┆ 3               ┆ true          │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 6       ┆ 5       ┆ 4               ┆ true          │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 7       ┆ 4       ┆ 5               ┆ true          │
└─────────┴─────────┴─────────────────┴───────────────┘

使用此策略，始终包含第一行，而无需将第一行单独连接到您的结果。那是因为您故意选择了一个fill_value永远不会匹配Column2.

添加is_first到表达式

如果您不确定Column2(even None) 中允许哪些值，那么我建议附加is_first到您的表达式中（而不是将第一行连接到结果数据集）：

(df
    .with_column(pl.col("Column2").shift().alias("Column2_shifted"))
    .with_column((pl.col("Column2").is_first() | (pl.col("Column2") != pl.col("Column2_shifted"))).alias("not_eq_result"))
).collect()

shape: (7, 4)
┌─────────┬─────────┬─────────────────┬───────────────┐
│ row_num ┆ Column2 ┆ Column2_shifted ┆ not_eq_result │
│ ---     ┆ ---     ┆ ---             ┆ ---           │
│ i64     ┆ i64     ┆ i64             ┆ bool          │
╞═════════╪═════════╪═════════════════╪═══════════════╡
│ 1       ┆ 1       ┆ null            ┆ true          │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2       ┆ 2       ┆ 1               ┆ true          │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3       ┆ 3       ┆ 2               ┆ true          │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4       ┆ 3       ┆ 3               ┆ false         │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 5       ┆ 4       ┆ 3               ┆ true          │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 6       ┆ 5       ┆ 4               ┆ true          │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 7       ┆ 4       ┆ 5               ┆ true          │
└─────────┴─────────┴─────────────────┴───────────────┘

这会强制您的第一行评估为True，仅仅因为它是第一行。（请注意表达式中的嵌套括号 - 否则您可能无法获得预期的结果。

这有助于澄清事情吗？

python-polars - 依赖于上一行的惰性过滤器（Polars Python）

2 回答 2

Related

Reference