java - 如何以加密格式保存火花数据集？

Question

我将我的 spark 数据集保存为本地机器中的 parquet 文件。我想知道是否有任何方法可以使用某种加密算法来加密数据。我用来将数据保存为镶木地板文件的代码如下所示。

dataset.write().mode("overwrite").parquet(parquetFile);

我看到了一个类似的问题，但是当我写入本地磁盘时，我的查询有所不同。

score 3 · Accepted Answer

从 Spark 3.2 开始，Parquet 表支持列加密。

例如：

hadoopConfiguration.set("parquet.encryption.kms.client.class" ,
   "org.apache.parquet.crypto.keytools.mocks.InMemoryKMS");

// Explicit master keys (base64 encoded) - required only for mock InMemoryKMS
hadoopConfiguration.set("parquet.encryption.key.list" ,
   "keyA:AAECAwQFBgcICQoLDA0ODw== ,  keyB:AAECAAECAAECAAECAAECAA==");

// Activate Parquet encryption, driven by Hadoop properties
hadoopConfiguration.set("parquet.crypto.factory.class" ,
   "org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory");

// Write encrypted dataframe files. 
// Column "square" will be protected with master key "keyA".
// Parquet file footers will be protected with master key "keyB"
squaresDF.write().
   option("parquet.encryption.column.keys" , "keyA:square").
   option("parquet.encryption.footer.key" , "keyB").
   parquet("/path/to/table.parquet.encrypted");

// Read encrypted dataframe files
Dataset<Row> df2 = spark.read().parquet("/path/to/table.parquet.encrypted");

这是基于以下使用示例： https ://spark.apache.org/docs/3.2.0/sql-data-sources-parquet.html#columnar-encryption

score 1 · Accepted Answer

我不认为你可以直接在 Spark 上做，但是你可以在 Parquet 周围放置其他项目，在特殊的 Apache Arrow 中。我认为这个视频解释了如何做到这一点：

https://databricks.com/session_na21/data-security-at-scale-through-spark-and-parquet-encryption

更新：因为 Spark 3.2.0 似乎有可能。

java - 如何以加密格式保存火花数据集？

2 回答 2

Related

Reference