hadoop - 自定义 Hive SerDe 无法选择列，但在我执行 SELECT * 时有效

Question

我正在编写一个自定义 SerDe，并且只会使用它来反序列化。底层数据是一个 thrift 二进制文件，每一行都是一个事件日志。每个事件都有一个我可以访问的模式，但是我们将事件包装在另一个模式中，让我们Message在存储之前调用它。我编写 SerDe 而不是使用ThriftDeserializer的原因是因为如前所述，底层事件被包装为消息。所以我们首先需要使用的模式进行反序列化Message，然后反序列化该事件的数据。

SerDe（仅）在我执行 a 时起作用，SELECT *并且我可以按预期反序列化数据，但是每当我从表中选择一列而不是 SELECT * 时，这些行都是 NULL。返回的对象检查ThriftStructObjectInspector器是一个，反序列化返回的对象是一个 TBase。

什么可能导致 Hive 在我们选择列时返回 NULL，但在我执行 SELECT * 时返回列数据？

这是 SerDe 类（更改了一些类名）：

public class MyThriftSerde extends AbstractSerDe {

  private static final Log LOG = LogFactory.getLog(MyThriftSerde.class);

  /* Abstracting away the deserialization of the underlying event which is wrapped in a message */
  private static final MessageDeserializer myMessageDeserializer =
      MessageDeserializer.getInstance();

  /* Underlying event class which is wrapped in a Message */
  private String schemaClassName;
  private Class<?> schemaClass;

  /* Used to read the input row */
  public static List<String> inputFieldNames;
  public static List<ObjectInspector> inputFieldOIs;
  public static List<Integer> notSkipIDs;
  public static ObjectInspector inputRowObjectInspector;

  /* Output Object Inspector */
  public static ObjectInspector thriftStructObjectInspector;

  @Override
  public void initialize(Configuration conf, Properties tbl) throws SerDeException {
    try {

      logHeading("INITIALIZE MyThriftSerde");

      schemaClassName = tbl.getProperty(SERIALIZATION_CLASS);
      schemaClass = conf.getClassByName(schemaClassName);

      LOG.info(String.format("Building DDL for event: %s", schemaClass.getName()));

      inputFieldNames = new ArrayList<>();
      inputFieldOIs = new ArrayList<>();
      notSkipIDs = new ArrayList<>();

      /* Initialize the Input fields */

      // The underlying data is stored in RCFile format, and only has 1 column, event_binary
      // So we create a ColumnarStructBase for each row we deserialize.
      // This ColumnasStruct only has 1 column: event_binary
      inputFieldNames.add("event_binary");
      notSkipIDs.add(0);
      inputFieldOIs.add(LazyPrimitiveObjectInspectorFactory.LAZY_BINARY_OBJECT_INSPECTOR);
      inputRowObjectInspector =
          ObjectInspectorFactory.getColumnarStructObjectInspector(inputFieldNames, inputFieldOIs);

      /* Output Object Inspector*/

      // This is what the SerDe will return, it is a ThriftStructObjectInspector
      thriftStructObjectInspector =
          ObjectInspectorFactory.getReflectionObjectInspector(
              schemaClass, ObjectInspectorFactory.ObjectInspectorOptions.THRIFT);

      // Only for debugging
      logHeading("THRIFT OBJECT INSPECTOR");
      LOG.info("Output OI Class Name: " + thriftStructObjectInspector.getClass().getName());
      LOG.info(
          "OI Details: "
              + ObjectInspectorUtils.getObjectInspectorName(thriftStructObjectInspector));

    } catch (Exception e) {
      LOG.info("Exception while initializing SerDe", e);
    }
  }

  @Override
  public Object deserialize(Writable rowWritable) throws SerDeException {

    logHeading("START DESERIALIZATION");

    ColumnarStructBase inputLazyStruct =
        new ColumnarStruct(inputRowObjectInspector, notSkipIDs, null);
    LazyBinary eventBinary;
    Message rowAsMessage;
    TBase deserializedRow = null;

    try {
      inputLazyStruct.init((BytesRefArrayWritable) rowWritable);
      eventBinary = (LazyBinary) inputLazyStruct.getField(0);
      rowAsMessage =
          myMessageDeserializer.fromBytes(eventBinary.getWritableObject().copyBytes(), null);
      deserializedRow = rowAsMessage.getEvent();

      LOG.info("deserializedRow.getClass(): " + deserializedRow.getClass());
      LOG.info("deserializedRow.toString(): " + deserializedRow.toString());

    } catch (Exception e) {
      e.printStackTrace();
    }

    logHeading("END DESERIALIZATION");

    return deserializedRow;
  }

  private void logHeading(String s) {
    LOG.info(String.format("-------------------  %s  -------------------", s));
  }

  @Override
  public ObjectInspector getObjectInspector() {
    return thriftStructObjectInspector;
  }
}

代码上下文：

在底层数据中，每一行仅包含 1 列（称为 event_binary），以二进制形式存储。二进制文件是一个包含 2 个字段“schema”+“event_data”的消息。即每一行都是一条消息，其中包含底层事件的模式+数据。我们使用 Message 中的模式来反序列化数据。
SerDe 首先将行反序列化为消息，提取事件数据，然后反序列化事件。

我创建了一个指向 Thrift 数据的外部表，使用

ADD JAR hdfs://my-jar.jar;

CREATE EXTERNAL TABLE dev_db.thrift_event_data_deserialized
ROW FORMAT SERDE 'com.test.only.MyThriftSerde'
WITH SERDEPROPERTIES (
  "serialization.class"="com.test.only.TestEvent"
) STORED AS RCFILE
LOCATION 'location/of/thrift/data';

MSCK REPAIR TABLE thrift_event_data_deserialized;

然后SELECT * FROM dev_db.thrift_event_data_deserialized LIMIT 10;按预期工作但是，SELECT column1_name, column2_name FROM dev_db.thrift_event_data_deserialized LIMIT 10;不起作用。

知道我在这里缺少什么吗？希望有任何帮助！

hadoop - 自定义 Hive SerDe 无法选择列，但在我执行 SELECT * 时有效

0 回答 0

Related

Reference