0

我有一个 pyspark 数据框,其中包含两个 id 列idid2. 每个id都准确地重复了n几次。所有id' 都有相同的id2' 集合。我正在尝试id根据id2.

这是一个解释我要实现的目标的示例,我的数据框如下所示:

+----+-----+--------+--------+
| id | id2 | value1 | value2 |
+----+-----+--------+--------+
| 1  | 1   | 54     | 2      |
+----+-----+--------+--------+
| 1  | 2   | 0      | 6      |
+----+-----+--------+--------+
| 1  | 3   | 578    | 14     |
+----+-----+--------+--------+
| 2  | 1   | 10     | 1      |
+----+-----+--------+--------+
| 2  | 2   | 6      | 32     |
+----+-----+--------+--------+
| 2  | 3   | 0      | 0      |
+----+-----+--------+--------+
| 3  | 1   | 12     | 2      |
+----+-----+--------+--------+
| 3  | 2   | 20     | 5      |
+----+-----+--------+--------+
| 3  | 3   | 63     | 22     |
+----+-----+--------+--------+

所需的输出如下表:

+----+----------+----------+----------+----------+----------+----------+
| id | value1_1 | value1_2 | value1_3 | value2_1 | value2_2 | value2_3 |
+----+----------+----------+----------+----------+----------+----------+
| 1  | 54       | 0        | 578      | 2        | 6        | 14       |
+----+----------+----------+----------+----------+----------+----------+
| 2  | 10       | 6        | 0        | 1        | 32       | 0        |
+----+----------+----------+----------+----------+----------+----------+
| 3  | 12       | 20       | 63       | 2        | 5        | 22       |
+----+----------+----------+----------+----------+----------+----------+

所以,基本上,对于每一个独特的id和每一列col,我都会有n新的列col_1,......对于每个n id2值。

任何帮助,将不胜感激!

4

1 回答 1

1

在 Spark 2.4 中,您可以这样做

var df3 =Seq((1,1,54 , 2 ),(1,2,0  , 6 ),(1,3,578, 14),(2,1,10 , 1 ),(2,2,6  , 32),(2,3,0  , 0 ),(3,1,12 , 2 ),(3,2,20 , 5 ),(3,3,63 , 22)).toDF("id","id2","value1","value2")


scala> df3.show()
+---+---+------+------+
| id|id2|value1|value2|
+---+---+------+------+
|  1|  1|    54|     2|
|  1|  2|     0|     6|
|  1|  3|   578|    14|
|  2|  1|    10|     1|
|  2|  2|     6|    32|
|  2|  3|     0|     0|
|  3|  1|    12|     2|
|  3|  2|    20|     5|
|  3|  3|    63|    22|
+---+---+------+------+

使用 coalesce 检索 id 的第一个值。

scala> var df4 = df3.groupBy("id").pivot("id2").agg(coalesce(first("value1")),coalesce(first("value2"))).orderBy(col("id"))

scala> val newNames = Seq("id","value1_1","value2_1","value1_2","value2_2","value1_3","value2_3")

重命名列

scala>  df4.toDF(newNames: _*).show()
+---+--------+--------+--------+--------+--------+--------+
| id|value1_1|value2_1|value1_2|value2_2|value1_3|value2_3|
+---+--------+--------+--------+--------+--------+--------+
|  1|      54|       2|       0|       6|     578|      14|
|  2|      10|       1|       6|      32|       0|       0|
|  3|      12|       2|      20|       5|      63|      22|
+---+--------+--------+--------+--------+--------+--------+

如果需要,重新排列列。如果您有任何与此相关的问题,请告诉我。快乐哈多普

于 2019-09-16T13:29:27.483 回答