1

I have been tasked with outputting a Pyspark Dataframe into cap'n proto (.capnp) format. Does anyone have a suggestion for the best way to do this?

I have a capnp schema, and I have seen the python wrapper for capnp (http://capnproto.github.io/pycapnp/), but I'm still not sure what's the best way to go from dataframe to capnp.

4

1 回答 1

0

最简单的方法是使用 RDD ,如果数据帧很大,则用于mapPartitions将分区收集为序列化字节数组并将它们加入collect()或保存到磁盘。toLocalIterator请参阅示例伪代码:

create = your_serialization_method
serialize_partition = lambda partition: [b''.join([create(object).to_bytes() for object in partition])] # creates one-element partition
output = b''.join(df.rdd.mapPartitions(serialize_partition).collect())
于 2017-09-14T21:26:47.220 回答