apache-spark - EMR Hudi 无法创建 hive 连接 jdbc:hive2://localhost:10000/

Question

尝试在启用配置单元同步的 Jupyter 笔记本中保存 hudi 表。我正在使用 EMR: 5.28.0 并启用 AWS Glue 作为目录：

# Create a DataFrame
inputDF = spark.createDataFrame(
    [
        ("100", "2015-01-01", "2015-01-01T13:51:39.340396Z"),
        ("101", "2015-01-01", "2015-01-01T12:14:58.597216Z"),
        ("102", "2015-01-01", "2015-01-01T13:51:40.417052Z"),
        ("103", "2015-01-01", "2015-01-01T13:51:40.519832Z"),
        ("104", "2015-01-02", "2015-01-01T12:15:00.512679Z"),
        ("105", "2015-01-02", "2015-01-01T13:51:42.248818Z"),
    ],
    ["id", "creation_date", "last_update_time"]
)

# Specify common DataSourceWriteOptions in the single hudiOptions variable
hudiOptions = {
'hoodie.table.name': 'my_hudi_table',
'hoodie.datasource.write.recordkey.field': 'id',
'hoodie.datasource.write.partitionpath.field': 'creation_date',
'hoodie.datasource.write.precombine.field': 'last_update_time',
'hoodie.datasource.hive_sync.enable': 'true',
'hoodie.datasource.hive_sync.table': 'my_hudi_table',
'hoodie.datasource.hive_sync.partition_fields': 'creation_date',
'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor'
}

# Write a DataFrame as a Hudi dataset
(inputDF.write
.format('org.apache.hudi')
.option('hoodie.datasource.write.operation', 'insert')
.options(**hudiOptions)
.mode('overwrite')
.save('s3://dytyniak-test-data/myhudidataset/'))

收到以下错误：

An error occurred while calling o309.save.
: org.apache.hudi.hive.HoodieHiveSyncException: Cannot create hive connection jdbc:hive2://localhost:10000/

score 1 · Accepted Answer

我假设您正在遵循 AWS 文档中的教程。我通过设置hive_sync.mode为hmsin使用 Hudi 0.9.0 让它工作hudiOptions（参见hudi docs）：

hudiOptions = {
    'hoodie.table.name': 'my_hudi_table',
    'hoodie.datasource.write.recordkey.field': 'id',
    'hoodie.datasource.write.partitionpath.field': 'creation_date',
    'hoodie.datasource.write.precombine.field': 'last_update_time',
    'hoodie.datasource.hive_sync.enable': 'true',
    'hoodie.datasource.hive_sync.table': 'my_hudi_table',
    'hoodie.datasource.hive_sync.partition_fields': 'creation_date',
    'hoodie.datasource.hive_sync.partition_extractor_class': 
    'org.apache.hudi.hive.MultiPartKeysValueExtractor',
    'hoodie.datasource.hive_sync.mode': 'hms'
}

apache-spark - EMR Hudi 无法创建 hive 连接 jdbc:hive2://localhost:10000/

1 回答 1

Related

Reference