0

理想情况下,以下代码片段将起作用:

import kudu 
from kudu.client import Partitioning

df = … #some spark dataframe 

# Connect to Kudu master server 
client = kudu.connect(host=‘…‘, port=7051)

# infer schema from spark dataframe
schema = df.schema 

# Define partitioning schema 
partitioning = Partitioning().add_hash_partitions(column_names=['key'], num_buckets=3) 

# Create new table 
client.create_table('dev.some_example', schema, partitioning)

但是 client.create_table 需要一个 kudu.schema.Schema 而不是来自数据帧的结构。但是在 Scala 中,您可以这样做(来自https://kudu.apache.org/docs/developing.html):

kuduContext.createTable(
"dev.some_example", df.schema, Seq("key"),
new CreateTableOptions()
    .setNumReplicas(1)
    .addHashPartitions(List("key").asJava, 3))

现在我想知道是否可以在不使用 kudu 模式构建器手动定义每一列的情况下对 PySpark 执行相同的操作?

4

1 回答 1

0

所以我给自己写了一个辅助函数来将 PySpark Dataframe 模式转换为 kudu.schema.Schema 我希望这对某人有所帮助。反馈表示赞赏!

附带说明,您可能想要添加或编辑数据类型映射。

import kudu
from kudu.client import Partitioning
def convert_to_kudu_schema(df_schema, primary_keys):
    builder = kudu.schema.SchemaBuilder()
    data_type_map = {
        "StringType":kudu.string,
        "LongType":kudu.int64,
        "IntegerType":kudu.int32,
        "FloatType":kudu.float,
        "DoubleType":kudu.double,
        "BooleanType":kudu.bool,
        "TimestampType":kudu.unixtime_micros,
    }

    for sf in df_schema:
        pk = False
        nullable=sf.nullable
        if (sf.name in primary_keys): 
            pk = True
            nullable = False

        builder.add_column(
            name=sf.name,
            nullable=nullable,
            type_=data_type_map[str(sf.dataType)]
        )
    builder.set_primary_keys(primary_keys)
    return builder.build()

你可以这样称呼它:

kudu_schema = convert_to_kudu_schema(df.schema,primary_keys=["key1","key2"])

我仍然愿意寻求更优雅的解决方案。;)

于 2018-10-31T10:43:38.140 回答