scala - IncompatibleSchemaException：以 Avro 格式序列化时出现意外类型 VectorUDT

Question

我正在使用 Spark Mllib 为我的数据生成预测，然后以 Avro 格式将它们存储到 HDFS：

val dataPredictions = myModel.transform(myData)
val output = dataPredictions.select("is", "probability", "prediction")
output.write.format("com.databricks.spark.avro").save(path)

我收到以下异常：

com.databricks.spark.avro.SchemaConverters$IncompatibleSchemaException:
    Unexpected type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7.

我的理解是“预测”列格式不能序列化为 Avro。

如何将 VectorUDT 转换为数组，以便在 Avro 中对其进行序列化？
有没有更好的选择（我无法摆脱 Avro 格式）？

score 2 · Accepted Answer

要将 any 转换Vector为 an，Array[Double]您可以使用以下 UDF：

import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.functions.col
import org.apache.spark.ml.linalg.Vector

val vectorToArrayUdf = udf((vector: Vector) => vector.toArray)

// The following will work
val output = dataPredictions
    .withColumn("probabilities", vectorToArrayUdf(col("probability")))
    .select("id", "probabilities", "prediction")

output.write.format("com.databricks.spark.avro").save(path)

scala - IncompatibleSchemaException：以 Avro 格式序列化时出现意外类型 VectorUDT

1 回答 1

Related

Reference