1

我正在测试 Spark 2.4.0 新的 from_avro 和 to_avro 函数。

我创建了一个只有一列和三行的数据框,用 avro 对其进行序列化,然后从 avro 将其反序列化。

如果输入数据集创建为

val input1 = Seq("foo", "bar", "baz").toDF("key")

+---+
|key|
+---+
|foo|
|bar|
|baz|
+---+

反序列化只返回最后一行的 N 个副本:

+---+
|key|
+---+
|baz|
|baz|
|baz|
+---+

如果我将输入数据集创建为

val input2 = input1.sqlContext.createDataFrame(input1.rdd, input1.schema)

结果是正确的:

+---+
|key|
+---+
|foo|
|bar|
|baz|
+---+

示例代码:

import org.apache.spark.sql.avro.{SchemaConverters, from_avro, to_avro}
import org.apache.spark.sql.DataFrame

val input1 = Seq("foo", "bar", "baz").toDF("key")
val input2 = input1.sqlContext.createDataFrame(input1.rdd, input1.schema)

def test_avro(df: DataFrame): Unit = {
  println("input df:")
  df.printSchema()
  df.show()

  val keySchema = SchemaConverters.toAvroType(df.schema).toString
  println(s"avro schema: $keySchema")

  val avroDf = df
    .select(to_avro($"key") as "key")

  println("avro serialized:")
  avroDf.printSchema()
  avroDf.show()

  val output = avroDf
    .select(from_avro($"key", keySchema) as "key")
    .select("key.*")

  println("avro deserialized:")
  output.printSchema()
  output.show()
}

println("############### testing .toDF()")
test_avro(input1)
println("############### testing .createDataFrame()")
test_avro(input2)

结果:

############### testing .toDF()
input df:
root
 |-- key: string (nullable = true)

+---+
|key|
+---+
|foo|
|bar|
|baz|
+---+

avro schema: {"type":"record","name":"topLevelRecord","fields":[{"name":"key","type":["string","null"]}]}
avro serialized:
root
 |-- key: binary (nullable = true)

+----------------+
|             key|
+----------------+
|[00 06 66 6F 6F]|
|[00 06 62 61 72]|
|[00 06 62 61 7A]|
+----------------+

avro deserialized:
root
 |-- key: string (nullable = true)

+---+
|key|
+---+
|baz|
|baz|
|baz|
+---+

############### testing .createDataFrame()
input df:
root
 |-- key: string (nullable = true)

+---+
|key|
+---+
|foo|
|bar|
|baz|
+---+

avro schema: {"type":"record","name":"topLevelRecord","fields":[{"name":"key","type":["string","null"]}]}
avro serialized:
root
 |-- key: binary (nullable = true)

+----------------+
|             key|
+----------------+
|[00 06 66 6F 6F]|
|[00 06 62 61 72]|
|[00 06 62 61 7A]|
+----------------+

avro deserialized:
root
 |-- key: string (nullable = true)

+---+
|key|
+---+
|foo|
|bar|
|baz|
+---+

从测试看来,问题出在反序列化阶段,因为打印 avro 序列化的 df 显示不同的行。

我做错了还是有错误?

4

1 回答 1

2

似乎这是一个错误。我提交了一个错误报告,现在它已在 2.3 和 2.4 分支中修复。

于 2019-06-21T22:39:06.553 回答