0

Here I am trying to simulate updates and deletes over a Hudi dataset and wish to see the state reflected in Athena table. We use EMR, S3 and Athena services of AWS.

  1. Attempting Record Update with a withdrawal object
withdrawalID_mutate = 10382495
updateDF = final_df.filter(col("withdrawalID") == withdrawalID_mutate) \ 
    .withColumn("accountHolderName", lit("Hudi_Updated"))  
    
updateDF.write.format("hudi") \
    .options(**hudi_options) \
    .mode("append") \
    .save(tablePath) 
    
hudiDF = spark.read \
    .format("hudi") \
    .load(tablePath).filter(col("withdrawalID") == withdrawalID_mutate).show() 

Shows the updated record but it is actually appended in the Athena table. Probably something to do with Glue Catalogue?

  1. Attempting Record Delete
deleteDF = updateDF #deleting the updated record above 
    
deleteDF.write.format("hudi") \ 
    .option('hoodie.datasource.write.operation', 'upsert') \
    .option('hoodie.datasource.write.payload.class', 'org.apache.hudi.common.model.EmptyHoodieRecordPayload') \
    .options(**hudi_options) \
    .mode("append") \
    .save(tablePath) 

still reflects the deleted record in the Athena table

Also tried using mode("overwrite") but as expected it deletes the older partitions and keeps only the latest.

Did anyone faced same issue and can guide in the right direction

4

0 回答 0