python - Outputting pyspark dataframe in capnp (cap'n proto) format

Question

I have been tasked with outputting a Pyspark Dataframe into cap'n proto (.capnp) format. Does anyone have a suggestion for the best way to do this?

I have a capnp schema, and I have seen the python wrapper for capnp (http://capnproto.github.io/pycapnp/), but I'm still not sure what's the best way to go from dataframe to capnp.

score 0 · Accepted Answer

最简单的方法是使用 RDD ，如果数据帧很大，则用于mapPartitions将分区收集为序列化字节数组并将它们加入collect()或保存到磁盘。toLocalIterator请参阅示例伪代码：

create = your_serialization_method
serialize_partition = lambda partition: [b''.join([create(object).to_bytes() for object in partition])] # creates one-element partition
output = b''.join(df.rdd.mapPartitions(serialize_partition).collect())

python - Outputting pyspark dataframe in capnp (cap'n proto) format

1 回答 1

Related

Reference