我有一个如下所示的数据框:
+----+----+----+
|colA|colB|colC|
+----+----+----+
|1 |1 |23 |
|1 |2 |63 |
|1 |3 |null|
|1 |4 |32 |
|2 |2 |56 |
+----+----+----+
我应用以下说明,以便在 C 列中创建一系列值:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
df.withColumn("colD",
collect_list("colC").over(Window.partitionBy("colA").orderBy("colB")))
结果是这样的,创建列 D 并包含列 C 的值作为序列,而它已删除null
值:
+----+----+----+------------+
|colA|colB|colC|colD |
+----+----+----+------------+
|1 |1 |23 |[23] |
|1 |2 |63 |[23, 63] |
|1 |3 |null|[23, 63] |
|1 |4 |32 |[23,63,32] |
|2 |2 |56 |[56] |
+----+----+----+------------+
但是,我想在新列中保留空值并得到以下结果:
+----+----+----+-----------------+
|colA|colB|colC|colD |
+----+----+----+-----------------+
|1 |1 |23 |[23] |
|1 |2 |63 |[23, 63] |
|1 |3 |null|[23, 63, null] |
|1 |4 |32 |[23,63,null, 32] |
|2 |2 |56 |[56] |
+----+----+----+-----------------+
如您所见,结果中仍然有null
值。你知道我该怎么做吗?