pyspark - Hudi 分区和 upsert 不起作用

Question

这个配置有什么问题，

分区键在 HUDI 中不起作用，并且所有记录在执行 upsert 时都会在 hudi 数据集中更新。所以无法从表中提取增量。

commonConfig = {'className' : 'org.apache.hudi',
'hoodie.datasource.hive_sync.use_jdbc':'false',
'hoodie.datasource.write.precombine.field': 'hash_value',
'hoodie.datasource.write.recordkey.field': 'hash_value',
'hoodie.datasource.hive_sync.partition_fields':'year,month,day',
'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor',
'hoodie.datasource.write.keygenerator.class':'org.apache.hudi.ComplexKeyGenerator',
'hoodie.table.name': 'hudi_account',
'hoodie.consistency.check.enabled': 'true',
'hoodie.datasource.hive_sync.database': 'hudi_db',
'hoodie.datasource.hive_sync.table': 'hudi_account',
'hoodie.datasource.hive_sync.enable': 'true',
'path': 's3://' + args['curated_bucket'] + '/stage_e/hudi_db/hudi_account'}

我的用例是使用 hudi 完成 upsert 逻辑并使用 hudi 进行分区。Upsert 正在部分工作，因为它更新了整个记录集，就像我在原始存储桶中有 10k 条记录一样，在为 1k 条记录执行 upsert 时，它更新了所有 10k 数据的 hudi 时间。

score 0 · Accepted Answer

你的分区键有变化吗？默认情况下 hudi 不使用全局索引，但是对于每个分区，当我启用全局索引时，我遇到了与您类似的问题。尝试添加这些设置：

 "hoodie.index.type": "GLOBAL_BLOOM",                 # This is required if we want to ensure we upsert a record, even if the partition changes
 "hoodie.bloom.index.update.partition.path": "true",  # This is required to write the data into the new partition (defaults to false in 0.8.0, true in 0.9.0)

我在这个博客上找到了答案：https ://dacort.dev/posts/updating-partition-values-with-apache-hudi/

在这里您可以看到有关 hudi 索引的更多信息：https ://hudi.apache.org/blog/2020/11/11/hudi-indexing-mechanisms/

pyspark - Hudi 分区和 upsert 不起作用

1 回答 1

Related

Reference