0

我正在使用基于基本数据集以各种方式处理 Thrift 对象条目的 SerDe。它本质上是一个美化的 Hive Struct,它在运行时处理基本数据集,而不是将结果存储在表中。最近,我已将集群从 Hive 0.7.1 升级到 Hive 0.10.0(使用 CDH3 -> CDH4.3.0),SerDe 不再懒惰地处理数据,而是似乎正在处理定义的每个字段。

I've dug through Hive's code, and looked through how our data is being deserialized in order to understand how it determines what fields it wants to process, but unfortunately it seems like it is processing all of the columns simply because our ObjectInspector returns all the fields of our custom object, and I can't seem to figure out how to control what fields are being processed.

What parts of Hive can I manipulate to change what fields are being processed? Is there a way I can detect what fields are being used in a query in order to disable functions in my object's internal state?

Edit: I realized that it'd be useful to include a stack trace to show where a particular function to process the data is being called due to it being an inspected field.

我已将自定义类名称替换为角色的描述性名称。

2013-10-08 17:02:45,198 INFO CustomStructFunction: Stack trace: java.lang.Throwable
    at CustomStructFunction.init(CustomStructFunction.java:490)
    at CustomStructFunctionBase.process(CustomStructFunctionBase.java:27)
    at CustomStructObject.callImplementor(CustomStructObject.java:332)
    at CustomStructField.callImplementor(CustomStructField.java:161)
    at CustomStructField.getValue(CustomStructField.java:131)
    at CustomStructObjectInspector.getStructFieldData(CustomStructObjectInspector.java:46)
    at org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorConverters$StructConverter.convert(ObjectInspectorConverters.java:298)
    at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:630)
    at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:141)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:417)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
    at org.apache.hadoop.mapred.Child.main(Child.java:262)
4

1 回答 1

0

事实证明,每次我们要求获取自定义对象时,它都会返回一个新的 ObjectInspector。这导致 Hive 认为自定义结构的输入格式与导出格式是分开的,这触发了 Hive 将数据转换为基本结构对象,这实际上意味着处理每个字段。

我没有在我们的基本自定义结构定义中每次都返回一个新的 ObjectInspector,而是将它留给扩展类来定义一个以 null 开头的静态 ObjectInspector。父类然后调用方法“getInnerObjectInspector”,如果它为空,它使用与新实例类似的设置方法来设置它。

于 2013-10-08T17:58:55.603 回答