1

我有一个类似于这个的数据集

经验 PID 命令
1 1-CR-P 1
1 C 1-MC 2
1 C 1-CR-C 3
1 聚丙烯 C 1-PP-C 4
2 2-CR-P 1
2 2-CR-P 1
2 C 2-MC 2
2 C 2-CR-C 3
2 C 2-CR-C 3
2 C 2-CR-C 3
2 C 2-CR-C 3
2 C 2-CR-C 3
2 聚丙烯 C 2-PP-C 4
2 聚丙烯 C 2-PP-C 4
2 聚丙烯 C 2-PP-C 4
2 聚丙烯 C 2-PP-C 4
2 聚丙烯 C 2-PP-C 4
3 C 3-MC 2
4 4-CR-P 1
4 C 4-MC 2
4 C 4-CR-C 3
4 聚丙烯 C 4-PP-C 4

我需要的是获得相同 exp 的前任的 pskey,给出以下关系:

订单 1 -> 没有前任

订单 2 -> 没有前任

订单 3 -> [1,2]

订单 4 -> [3]

并将这些值添加到一个名为的新列中predecessor

预期的结果如下:

+---+---+---+------+-----+----------------------------------------+
|exp|pid|mat|pskey |order|predecessor                             |
+---+---+---+------+-----+----------------------------------------+
|1  |CR |P  |1-CR-P|1    |null                                    |
|1  |M  |C  |1-M-C |2    |null                                    |
|1  |CR |C  |1-CR-C|3    |[1-CR-P, 1-M-C ]                        |
|1  |PP |C  |1-PP-C|4    |[1-CR-C]                                |
|3  |M  |C  |3-M-C |2    |null                                    |
|2  |CR |P  |2-CR-P|1    |null                                    |
|2  |CR |P  |2-CR-P|1    |null                                    |
|2  |M  |C  |2-M-C |2    |null                                    |
|2  |CR |C  |2-CR-C|3    |[2-CR-P, 2-M-C]                         |
|2  |CR |C  |2-CR-C|3    |[2-CR-P, 2-M-C]                         |
|2  |CR |C  |2-CR-C|3    |[2-CR-P, 2-M-C]                         |
|2  |CR |C  |2-CR-C|3    |[2-CR-P, 2-M-C]                         |
|2  |CR |C  |2-CR-C|3    |[2-CR-P, 2-M-C]                         |
|2  |PP |C  |2-PP-C|4    |[2-CR-C]                                |
|2  |PP |C  |2-PP-C|4    |[2-CR-C]                                |
|2  |PP |C  |2-PP-C|4    |[2-CR-C]                                |
|2  |PP |C  |2-PP-C|4    |[2-CR-C]                                |
|2  |PP |C  |2-PP-C|4    |[2-CR-C]                                |
|4  |CR |P  |4-CR-P|1    |null                                    |
|4  |M  |C  |4-M-C |2    |null                                    |
|4  |CR |C  |4-CR-C|3    |[4-CR-P, 4-M-C]                         |
|4  |PP |C  |4-PP-C|4    |[4-CR-C]                                |
+---+---+---+------+-----+----------------------------------------+

我对 pyspark 很陌生,所以我不知道如何管理它。

4

1 回答 1

2

上的不同情况order使用when. 您可以使用 a 聚合这些值collect_set以获取统一标识符:

from pyspark.sql import functions as F, Window 

df2 = df.withColumn(
    "predecessor",
    F.when(
        F.col("order") == 3,
        F.collect_set(F.col("pskey")).over(
            Window.partitionBy("exp").orderBy("order").rangeBetween(-2, -1)
        ),
    ).when(
        F.col("order") == 4,
        F.collect_set(F.col("pskey")).over(
            Window.partitionBy("exp").orderBy("order").rangeBetween(-1, -1)
        ),
    ),
)

结果 :

df2.show(truncate=False)
+---+---+---+------+-----+----------------+                                     
|exp|pid|mat|pskey |order|predecessor     |
+---+---+---+------+-----+----------------+
|1  |CR |P  |1-CR-P|1    |null            |
|1  |M  |C  |1-M-C |2    |null            |
|1  |CR |C  |1-CR-C|3    |[1-CR-P, 1-M-C ]|
|1  |PP |C  |1-PP-C|4    |[1-CR-C]        |
|3  |M  |C  |3-M-C |2    |null            |
|2  |CR |P  |2-CR-P|1    |null            |
|2  |CR |P  |2-CR-P|1    |null            |
|2  |M  |C  |2-M-C |2    |null            |
|2  |CR |C  |2-CR-C|3    |[2-CR-P, 2-M-C ]|
|2  |CR |C  |2-CR-C|3    |[2-CR-P, 2-M-C ]|
|2  |CR |C  |2-CR-C|3    |[2-CR-P, 2-M-C ]|
|2  |CR |C  |2-CR-C|3    |[2-CR-P, 2-M-C ]|
|2  |CR |C  |2-CR-C|3    |[2-CR-P, 2-M-C ]|
|2  |PP |C  |2-PP-C|4    |[2-CR-C]        |
|2  |PP |C  |2-PP-C|4    |[2-CR-C]        |
|2  |PP |C  |2-PP-C|4    |[2-CR-C]        |
|2  |PP |C  |2-PP-C|4    |[2-CR-C]        |
|2  |PP |C  |2-PP-C|4    |[2-CR-C]        |
|4  |CR |P  |4-CR-P|1    |null            |
|4  |M  |C  |4-M-C |2    |null            |
+---+---+---+------+-----+----------------+
only showing top 20 rows
于 2021-11-15T16:38:51.943 回答