google-bigquery - 如何从 Apache Beam KafkaIO 中的 kafka 主题推断 avro 模式

Question

我正在使用 Apache Beam 的 kafkaIO 从 Confluent 模式注册表中具有 avro 模式的主题中读取数据。我能够反序列化消息并写入文件。但最终我想写信给 BigQuery。我的管道无法推断架构。如何提取/推断架构并将其附加到管道中的数据，以便我的下游进程（写入 BigQuery）可以推断架构？

这是我使用模式注册表 url 设置反序列化器以及从 Kafka 读取的代码：

    consumerConfig.put(
                        AbstractKafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG, 
                        options.getSchemaRegistryUrl());

String schemaUrl = options.getSchemaRegistryUrl().get();
String subj = options.getSubject().get();

ConfluentSchemaRegistryDeserializerProvider<GenericRecord> valDeserializerProvider =
            ConfluentSchemaRegistryDeserializerProvider.of(schemaUrl, subj);

pipeline
        .apply("Read from Kafka",
                KafkaIO
                        .<byte[], GenericRecord>read()
                        .withBootstrapServers(options.getKafkaBrokers().get())
                        .withTopics(Utils.getListFromString(options.getKafkaTopics()))
                        .withConsumerConfigUpdates(consumerConfig)
                        .withValueDeserializer(valDeserializerProvider)
                        .withKeyDeserializer(ByteArrayDeserializer.class)

                        .commitOffsetsInFinalize()
                        .withoutMetadata()

        );

我最初认为这足以让 beam 推断架构，但它并没有因为 hasSchema() 返回 false。

任何帮助，将不胜感激。

score 1 · Accepted Answer

支持推断 Avro 模式的工作正在进行中，存储在 Confluent Schema Registry 中，在KafkaIO. 不过，现在也可以在用户管道代码中执行此操作。

score 0 · Accepted Answer

这段代码可能会起作用，但我还没有测试过。

// Fetch Avro schema from CSR
SchemaRegistryClient registryClient = new CachedSchemaRegistryClient("schema_registry_url", 10);
SchemaMetadata latestSchemaMetadata = registryClient.getLatestSchemaMetadata("schema_name");
Schema avroSchema = new Schema.Parser().parse(latestSchemaMetadata.getSchema());

PipelineOptions options = PipelineOptionsFactory.create();
Pipeline p = Pipeline.create(options);


// Create KafkaIO.Read with Avro schema deserializer
KafkaIO.Read<String, GenericRecord> read = KafkaIO.<String, GenericRecord>read()
    .withBootstrapServers("host:port")
    .withTopic("topic_name")
    .withConsumerConfigUpdates(ImmutableMap.of("schema.registry.url", schemaRegistryUrl))
    .withKeyDeserializer(StringDeserializer.class)
    .withValueDeserializerAndCoder((Class) KafkaAvroDeserializer.class, AvroCoder.of(avroSchema));

// Apply Kafka.Read and set Beam schema based on Avro Schema
p.apply(read)
 .apply(Values.<GenericRecord>create()).setSchema(schema,
    AvroUtils.getToRowFunction(GenericRecord.class, avroSchema),
    AvroUtils.getFromRowFunction(GenericRecord.class))

然后我认为你可以使用BigQueryIO.Writewith useBeamSchema()。

google-bigquery - 如何从 Apache Beam KafkaIO 中的 kafka 主题推断 avro 模式

2 回答 2

Related

Reference