avro - Avro 架构演变

Question

我有两个问题：

是否可以使用相同的阅读器并解析用两个兼容的模式编写的记录，例如Schema V2只有一个额外的可选字段，Schema V1并且我希望读者理解两者？我认为这里的答案是否定的，但如果是，我该怎么做？
我尝试使用写入记录Schema V1并读取它Schema V2，但出现以下错误：

org.apache.avro.AvroTypeException：找到 foo，期待 foo

我使用了 avro-1.7.3 并且：

   writer = new GenericDatumWriter<GenericData.Record>(SchemaV1);
   reader = new GenericDatumReader<GenericData.Record>(SchemaV2, SchemaV1);

以下是这两种模式的示例（我也尝试过添加命名空间，但没有成功）。

架构 V1：

{
"name": "foo",
"type": "record",
"fields": [{
    "name": "products",
    "type": {
        "type": "array",
        "items": {
            "name": "product",
            "type": "record",
            "fields": [{
                "name": "a1",
                "type": "string"
            }, {
                "name": "a2",
                "type": {"type": "fixed", "name": "a3", "size": 1}
            }, {
                "name": "a4",
                "type": "int"
            }, {
                "name": "a5",
                "type": "int"
            }]
        }
    }
}]
}

架构 V2：

{
"name": "foo",
"type": "record",
"fields": [{
    "name": "products",
    "type": {
        "type": "array",
        "items": {
            "name": "product",
            "type": "record",
            "fields": [{
                "name": "a1",
                "type": "string"
            }, {
                "name": "a2",
                "type": {"type": "fixed", "name": "a3", "size": 1}
            }, {
                "name": "a4",
                "type": "int"
            }, {
                "name": "a5",
                "type": "int"
            }]
        }
    }
},
{
            "name": "purchases",
            "type": ["null",{
                    "type": "array",
                    "items": {
                            "name": "purchase",
                            "type": "record",
                            "fields": [{
                                    "name": "a1",
                                    "type": "int"
                            }, {
                                    "name": "a2",
                                    "type": "int"
                            }]
                    }
            }]
}]
}

提前致谢。

score 12 · Accepted Answer

我遇到了同样的问题。这可能是 avro 的一个错误，但您可能可以通过在“purchase”字段中添加“default”：null 来解决。

查看我的博客了解详情：http ://ben-tech.blogspot.com/2013/05/avro-schema-evolution.html

score 0 · Accepted Answer

你可以做相反的事情。意味着您可以解析数据模式 1 并从模式 2 写入数据。因为在写入时它将数据写入文件，如果我们在读取时不提供任何字段，那就没问题了。但是如果我们写的字段比读取的少，它在读取时不会识别额外的字段，所以会出错。

score -1 · Accepted Answer

最好的方法是有一个模式映射来维护像 Confluent Avro 模式注册表这样的模式。

关键要点：

1.  Unlike Thrift, avro serialized objects do not hold any schema.
2.  As there is no schema stored in the serialized byte array, one has to provide the schema with which it was written.
3.  Confluent Schema Registry provides a service to maintain schema versions.
4.  Confluent provides Cached Schema Client, which checks in cache first before sending the request over the network.
5.  Json Schema present in “avsc” file is different from the schema present in Avro Object.
6.  All Avro objects extends from Generic Record
7.  During Serialization : based on schema of the Avro Object a schema Id is requested from the Confluent Schema Registry.
8.  The schemaId which is a INTEGER is converted to Bytes and prepend to serialized AvroObject.
9.  During Deserialization : First 4 bytes are removed from the ByteArray.  4 bytes are converted back to INTEGER(SchemaId)
10. Schema is requested from the Confluent Schema Registry and using this schema the byteArray is deserialized.

http://bytepadding.com/big-data/spark/avro/avro-serialization-de-serialization-using-confluent-schema-registry/

avro - Avro 架构演变

3 回答 3

Related

Reference