apache-spark - 如何将数据集序列化为二进制文件/镶木地板？

Question

我应该如何序列化一个DataSet？有没有办法使用Encoder创建二进制文件，或者我应该将其转换为 aDataFrame然后将其保存为镶木地板？

score 2 · Accepted Answer

我应该如何序列化数据集？

dataset.toDF().write.parquet("")

我相信它会遵守数据集自动使用的模式。

有没有办法使用编码器创建二进制文件

基于Encoder（对于 1.6.0）的源代码，它旨在将输入数据源转换为数据集（InternalRow准确地说是往返，但这是一个非常低级的细节）。默认实现将数据帧中的每一列匹配到案例类（对于 scala）或元组或原语中，以生成数据集。

score 1 · Accepted Answer

I think you are using Java or Scala, right? Because PySpark doesn't have support for Dataset yet. In my experience the best you can do is to save your data as parquet file in HDFS, because I have noticed that the time required to read the file gets reduced comparing it with other formats like csv and others.

Sorry for my digression, but I thought it was important. As you can see in the documentation of Dataset class, you can't notice any method to save the data, therefore my suggestion is to use toDF method from Dataset and then using write method from DataFrame. Or also use the DataFrameWriter final class, using the parquet method.

apache-spark - 如何将数据集序列化为二进制文件/镶木地板？

2 回答 2

Related

Reference