0

I have a use case where I am pushing the data from Mongodb to HDFS in orc file which runs every 1 day interval and appends the data in orc file existing in hdfs.

Now my concern is if while writing to orc file , the job somehow gets failed or stopped. How should I handle that scenario taking in consideration that some data is already written in orc file. I want to avoid duplicate in orc file.

Snippet for writing to orc file format -

  val df = sparkSession
          .read
          .mongo(ReadConfig(Map("database" -> "dbname", "collection" -> "tableName")))
          .filter($"insertdatetime" >= fromDateTime && $"insertdatetime" <= toDateTime)

        df.write
          .mode(SaveMode.Append)
          .format("orc")
          .save(/path_to_orc_file_on_hdfs)

I don't want to go for checkpoint the complete RDD as it will be very expensive operation. Also, I don't want to create multiple orc file. Requirement is to maintain single file only.

Any other solution or approach I should try ?

4

1 回答 1

0

嗨,最好的方法之一是将数据写入 HDFS 下每天一个文件夹。

因此,如果您的 ORC 编写工作失败,您将能够清理文件夹。

清洁应该发生在您工作的 bash 方面。如果返回码 != 0 则删除 ORC 文件夹。然后重试。

编辑:通过写入日期进行分区将对您 ORC 稍后使用 spark 阅读更强大

于 2020-02-04T15:04:24.007 回答