google-cloud-dataproc - 清理 BigQueryInputFormat 临时文件

Question

我在 spark 作业中使用 BigQueryInputFormat，将数据直接从 Bigquery 加载到 RDD 中。文档说明您应该使用以下命令清理临时文件：

BigQueryInputFormat.cleanupJob（作业）

但是，从 Spark 工作中，当“工作”是 hadoop 工作时，我该怎么做？

谢谢，卢克

score 2 · Accepted Answer

想通了，您可以设置一个自定义临时路径，该路径对您的 spark 作业是唯一的，并在作业结束时删除该路径：

hadoopConf.set(BigQueryConfiguration.TEMP_GCS_PATH_KEY, "gs://mybucket/hadoop/tmp/1234")

...

FileSystem.get(new Configuration()).delete(new Path(hadoopConf.get(BigQueryConfiguration.TEMP_GCS_PATH_KEY)), true)

google-cloud-dataproc - 清理 BigQueryInputFormat 临时文件

1 回答 1

Related

Reference