apache-spark - 带有 ScalaPB 的 SparkSQL：从 DataFrame 转换为原型数据集时跳过原型字段时出错

Question

我有以下原始消息，需要使用 ScalaPB 通过 Spark 编写：

message EnforcementData
{
  required int32 id = 1;
  required int32 source = 2;
  required int32 flagsEnforceOption = 4;
  required int32 categoryEnforceOption = 5;

  optional TypeA a= 100;
  optional TypeB b= 101;
}

TypeA并且TypeB是接收方的子类EnforcementData，它使用 protobuf-net 来反序列化。

现在，我的 Spark 数据框可以包含 a 列或 b 列。假设， df 是我的数据框，我称之为：

df.withColumn(b, null).as[EnforcementData].map(_.toByteArray)对于 TypeA 消息
df.withColumn(a, null).as[EnforcementData].map(_.toByteArray)用于 B 类消息

但是使用 protobuf-net 反序列化消息的接收器会引发 StackOverflow 异常。我还尝试传递一个虚拟案例类而不是 null ，但它似乎仍然不起作用。

请让我知道如何处理？

score 0 · Accepted Answer

我能够通过重建案例类并明确跳过可选的子类字段来解决这个问题。IE

 //for TypeA messages,

 df.withColumn(b, null)
   .as[EnforcementData]
   .map{case EnforcementData(id, source, flag, cat, a, _) => EnforcementData(id, source, flag, cat, a = a) 
   } 

 //for TypeB messages,    

 df.withColumn(s, null)
   .as[EnforcementData]
   .map{case EnforcementData(id, source, flag, cat, _, b) => EnforcementData(id, source, flag, cat, b = b) 
    }

apache-spark - 带有 ScalaPB 的 SparkSQL：从 DataFrame 转换为原型数据集时跳过原型字段时出错

1 回答 1

Related

Reference