0

我正在寻找一种通过 Hive 获取以下 AVSC 文件内容并将嵌套模式“RENTALRECORDTYPE”外部化的方法,以实现模式重用。

{
    "type": "record",
    "name": "EMPLOYEE",
    "namespace": "",
    "doc": "EMPLOYEE is a person that works here",
    "fields": [
        {
            "name": "RENTALRECORD",
            "type": {
                "type": "record",
                "name": "RENTALRECORDTYPE",
                "namespace": "",
                "doc": "Rental record is a record that is kept on every item rented",
                "fields": [
                    {
                        "name": "due_date",
                        "doc": "The date when item is due",
                        "type": "int"
                    } 
                ]
            }
        },
        {
            "name": "hire_date",
            "doc": "Employee date of hire",
            "type": "int"
        }
    ]
}

这种定义模式的方法效果很好。我能够发出以下 HiveQL 语句并成功创建表。

CREATE EXTERNAL TABLE employee
STORED AS AVRO
LOCATION '/user/dtom/store/data/employee'
TBLPROPERTIES ('avro.schema.url'='/user/dtom/store/schema/employee.avsc');

但是,我希望能够引用现有架构,而不是在多个架构中复制记录定义。例如,将生成两个 AVSC 文件,而不是单个模式文件。即rentalrecord.avsc 和employee.avsc。

出租记录.avsc

{
    "type": "record",
    "name": "RENTALRECORD",
    "namespace": "",
    "doc": "A record that is kept for every rental",
    "fields": [
        {
            "name": "due_date",
            "doc": "The date on which the rental is due back to the store",
            "type": "int"
        }
    ]
}

员工.avsc

{
    "type": "record",
    "name": "EMPLOYEE",
    "namespace": "",
    "doc": "EMPLOYEE is a person that works for the VIDEO STORE",
    "fields": [
        {
            "name": "rentalrecord",
            "doc": "A rental record is a record on every rental",
            "type": "RENTALRECORD"
        },
        {
            "name": "hire_date",
            "doc": "Employee date of hire",
            "type": "int"
        }
    ]
}

在上述场景中,我们希望能够将RENTALRECORD模式定义外部化,并能够在employee.avsc和其他地方重用它。

尝试使用以下两个 HiveQL 语句导入架构时,它失败了……</p>

CREATE EXTERNAL TABLE rentalrecord
STORED AS AVRO
LOCATION '/user/dtom/store/data/rentalrecord'
TBLPROPERTIES ('avro.schema.url'='/user/dtom/store/schema /rentalrecord.avsc');

CREATE EXTERNAL TABLE employee
STORED AS AVRO
LOCATION '/user/dtom/store/data/employee'
TBLPROPERTIES ('avro.schema.url'='/user/dtom/store/schema/employee.avsc');

Rentalrecord.avsc 导入成功,但employee.avsc 在第一个字段定义上失败。“RENTALRECORD”类型的字段。Hive 输出以下错误...</p>

失败:执行错误,从 org.apache.hadoop.hive.ql.exec.DDLTask 返回代码 1。java.lang.RuntimeException: MetaException(message:org.apache.hadoop.hive.serde2.SerDeException 遇到异常确定模式。返回信号模式以指示问题:“RENTALRECORD”不是定义的名称。“rentalrecord”字段的类型必须是定义的名称或 {"type": ...} 表达式。)

我的研究告诉我,Avro 文件确实支持这种形式的模式重用。所以要么我错过了一些东西,要么这是 Hive 不支持的东西。

任何帮助将不胜感激。

4

1 回答 1

0

我已经定义了一个包含所有引用的 AVDL,然后使用带有 idl2schemata 选项的 avro 工具 jar 文件来生成 avsc。生成的 avsc 就像 hive 的魅力!

于 2018-04-17T10:38:25.400 回答