I have a use case where I am pushing the data from Mongodb to HDFS in orc file which runs every 1 day interval and appends the data in orc file existing in hdfs.
Now my concern is if while writing to orc file , the job somehow gets failed or stopped. How should I handle that scenario taking in consideration that some data is already written in orc file. I want to avoid duplicate in orc file.
Snippet for writing to orc file format -
val df = sparkSession
.read
.mongo(ReadConfig(Map("database" -> "dbname", "collection" -> "tableName")))
.filter($"insertdatetime" >= fromDateTime && $"insertdatetime" <= toDateTime)
df.write
.mode(SaveMode.Append)
.format("orc")
.save(/path_to_orc_file_on_hdfs)
I don't want to go for checkpoint the complete RDD as it will be very expensive operation. Also, I don't want to create multiple orc file. Requirement is to maintain single file only.
Any other solution or approach I should try ?