apache-spark - 在 Spark DataFrame 中将空值转换为空数组

Question

我有一个 Spark 数据框，其中一列是整数数组。该列可以为空，因为它来自左外连接。我想将所有空值转换为一个空数组，这样我以后就不必处理空值了。

我以为我可以这样做：

val myCol = df("myCol")
df.withColumn( "myCol", when(myCol.isNull, Array[Int]()).otherwise(myCol) )

但是，这会导致以下异常：

java.lang.RuntimeException: Unsupported literal type class [I [I@5ed25612
at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:49)
at org.apache.spark.sql.functions$.lit(functions.scala:89)
at org.apache.spark.sql.functions$.when(functions.scala:778)

when显然，该函数不支持数组类型。还有其他简单的方法来转换空值吗？

如果它是相关的，这里是这个列的架构：

|-- myCol: array (nullable = true)
|    |-- element: integer (containsNull = false)

score 30 · Accepted Answer

您可以使用 UDF：

import org.apache.spark.sql.functions.udf

val array_ = udf(() => Array.empty[Int])

与WHEN或结合COALESCE：

df.withColumn("myCol", when(myCol.isNull, array_()).otherwise(myCol))
df.withColumn("myCol", coalesce(myCol, array_())).show

在最近的版本中，您可以使用array以下功能：

import org.apache.spark.sql.functions.{array, lit}

df.withColumn("myCol", when(myCol.isNull, array().cast("array<integer>")).otherwise(myCol))
df.withColumn("myCol", coalesce(myCol, array().cast("array<integer>"))).show

string请注意，只有在允许从所需类型转换时它才会起作用。

当然，同样的事情也可以在 PySpark 中完成。对于遗留解决方案，您可以定义udf

from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, IntegerType

def empty_array(t):
    return udf(lambda: [], ArrayType(t()))()

coalesce(myCol, empty_array(IntegerType()))

在最近的版本中，只需使用array：

from pyspark.sql.functions import array

coalesce(myCol, array().cast("array<integer>"))

score 13 · Accepted Answer

通过对 zero323 的方法稍作修改，我无需在 Spark 2.3.1 中使用 udf 就可以做到这一点。

val df = Seq("a" -> Array(1,2,3), "b" -> null, "c" -> Array(7,8,9)).toDF("id","numbers")
df.show
+---+---------+
| id|  numbers|
+---+---------+
|  a|[1, 2, 3]|
|  b|     null|
|  c|[7, 8, 9]|
+---+---------+

val df2 = df.withColumn("numbers", coalesce($"numbers", array()))
df2.show
+---+---------+
| id|  numbers|
+---+---------+
|  a|[1, 2, 3]|
|  b|       []|
|  c|[7, 8, 9]|
+---+---------+

score 4 · Accepted Answer

当您希望数组元素的数据类型无法从中转换时，可以使用无 UDF 的替代方法StringType如下：

import pyspark.sql.types as T
import pyspark.sql.functions as F

df.withColumn(
    "myCol",
    F.coalesce(
        F.col("myCol"),
        F.from_json(F.lit("[]"), T.ArrayType(T.IntegerType()))
    )
)

您可以替换IntegerType()为任何数据类型，也可以替换为复杂的数据类型。

apache-spark - 在 Spark DataFrame 中将空值转换为空数组

3 回答 3

Related

Reference