我正在编写一个自定义 SerDe,并且只会使用它来反序列化。底层数据是一个 thrift 二进制文件,每一行都是一个事件日志。每个事件都有一个我可以访问的模式,但是我们将事件包装在另一个模式中,让我们Message
在存储之前调用它。我编写 SerDe 而不是使用ThriftDeserializer的原因是因为如前所述,底层事件被包装为消息。所以我们首先需要使用 的模式进行反序列化Message
,然后反序列化该事件的数据。
SerDe(仅)在我执行 a 时起作用,SELECT *
并且我可以按预期反序列化数据,但是每当我从表中选择一列而不是 SELECT * 时,这些行都是 NULL。返回的对象检查ThriftStructObjectInspector
器是一个,反序列化返回的对象是一个 TBase。
什么可能导致 Hive 在我们选择列时返回 NULL,但在我执行 SELECT * 时返回列数据?
这是 SerDe 类(更改了一些类名):
public class MyThriftSerde extends AbstractSerDe {
private static final Log LOG = LogFactory.getLog(MyThriftSerde.class);
/* Abstracting away the deserialization of the underlying event which is wrapped in a message */
private static final MessageDeserializer myMessageDeserializer =
MessageDeserializer.getInstance();
/* Underlying event class which is wrapped in a Message */
private String schemaClassName;
private Class<?> schemaClass;
/* Used to read the input row */
public static List<String> inputFieldNames;
public static List<ObjectInspector> inputFieldOIs;
public static List<Integer> notSkipIDs;
public static ObjectInspector inputRowObjectInspector;
/* Output Object Inspector */
public static ObjectInspector thriftStructObjectInspector;
@Override
public void initialize(Configuration conf, Properties tbl) throws SerDeException {
try {
logHeading("INITIALIZE MyThriftSerde");
schemaClassName = tbl.getProperty(SERIALIZATION_CLASS);
schemaClass = conf.getClassByName(schemaClassName);
LOG.info(String.format("Building DDL for event: %s", schemaClass.getName()));
inputFieldNames = new ArrayList<>();
inputFieldOIs = new ArrayList<>();
notSkipIDs = new ArrayList<>();
/* Initialize the Input fields */
// The underlying data is stored in RCFile format, and only has 1 column, event_binary
// So we create a ColumnarStructBase for each row we deserialize.
// This ColumnasStruct only has 1 column: event_binary
inputFieldNames.add("event_binary");
notSkipIDs.add(0);
inputFieldOIs.add(LazyPrimitiveObjectInspectorFactory.LAZY_BINARY_OBJECT_INSPECTOR);
inputRowObjectInspector =
ObjectInspectorFactory.getColumnarStructObjectInspector(inputFieldNames, inputFieldOIs);
/* Output Object Inspector*/
// This is what the SerDe will return, it is a ThriftStructObjectInspector
thriftStructObjectInspector =
ObjectInspectorFactory.getReflectionObjectInspector(
schemaClass, ObjectInspectorFactory.ObjectInspectorOptions.THRIFT);
// Only for debugging
logHeading("THRIFT OBJECT INSPECTOR");
LOG.info("Output OI Class Name: " + thriftStructObjectInspector.getClass().getName());
LOG.info(
"OI Details: "
+ ObjectInspectorUtils.getObjectInspectorName(thriftStructObjectInspector));
} catch (Exception e) {
LOG.info("Exception while initializing SerDe", e);
}
}
@Override
public Object deserialize(Writable rowWritable) throws SerDeException {
logHeading("START DESERIALIZATION");
ColumnarStructBase inputLazyStruct =
new ColumnarStruct(inputRowObjectInspector, notSkipIDs, null);
LazyBinary eventBinary;
Message rowAsMessage;
TBase deserializedRow = null;
try {
inputLazyStruct.init((BytesRefArrayWritable) rowWritable);
eventBinary = (LazyBinary) inputLazyStruct.getField(0);
rowAsMessage =
myMessageDeserializer.fromBytes(eventBinary.getWritableObject().copyBytes(), null);
deserializedRow = rowAsMessage.getEvent();
LOG.info("deserializedRow.getClass(): " + deserializedRow.getClass());
LOG.info("deserializedRow.toString(): " + deserializedRow.toString());
} catch (Exception e) {
e.printStackTrace();
}
logHeading("END DESERIALIZATION");
return deserializedRow;
}
private void logHeading(String s) {
LOG.info(String.format("------------------- %s -------------------", s));
}
@Override
public ObjectInspector getObjectInspector() {
return thriftStructObjectInspector;
}
}
代码上下文:
- 在底层数据中,每一行仅包含 1 列(称为 event_binary),以二进制形式存储。二进制文件是一个包含 2 个字段“schema”+“event_data”的消息。即每一行都是一条消息,其中包含底层事件的模式+数据。我们使用 Message 中的模式来反序列化数据。
- SerDe 首先将行反序列化为消息,提取事件数据,然后反序列化事件。
我创建了一个指向 Thrift 数据的外部表,使用
ADD JAR hdfs://my-jar.jar;
CREATE EXTERNAL TABLE dev_db.thrift_event_data_deserialized
ROW FORMAT SERDE 'com.test.only.MyThriftSerde'
WITH SERDEPROPERTIES (
"serialization.class"="com.test.only.TestEvent"
) STORED AS RCFILE
LOCATION 'location/of/thrift/data';
MSCK REPAIR TABLE thrift_event_data_deserialized;
然后SELECT * FROM dev_db.thrift_event_data_deserialized LIMIT 10;
按预期工作但是,SELECT column1_name, column2_name FROM dev_db.thrift_event_data_deserialized LIMIT 10;
不起作用。
知道我在这里缺少什么吗?希望有任何帮助!