2

我正在尝试使用使用 HCatalog JSON Serde(来自 hcatalog-core-0.5.0-cdh4.7.0.jar)的配置单元表。我在 CDH4(Hadoop 2.0.0-cdh4.7.0 和 Hive 0.10.0-cdh4.7.0)上运行。

表定义:

CREATE EXTERNAL TABLE some_table(
  user_id int COMMENT 'from deserializer',
  event_time int COMMENT 'from deserializer',
  some_string string COMMENT 'from deserializer',
  some_id string COMMENT 'from deserializer',
  another_id int COMMENT 'from deserializer')
PARTITIONED BY (
  year int,
  month int,
  day int)
ROW FORMAT SERDE
  'org.apache.hcatalog.data.JsonSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'hdfs://localhost:8020/somedir/some_table'
TBLPROPERTIES (
  'last_modified_by'='volker',
  'last_modified_time'='1424980336',
  'transient_lastDdlTime'='1424980952')

分区创建如下:

alter table some_table add if not exists partition (year=2015,month=02,day=26) location '/somedir/some_table/year=2015/month=02/day=26'

第一遍很顺利,我可以在选择所有列时读取数据:

hive> select * from some_table limit 10;
OK
671764813   1424980760  fbx NtiwgY  6   2015    02  26
1632511524  1424980760  fbx AdMybO  10  2015    02  26
1201817175  1424980760  fbx GgQJEd  6   2015    02  26
1621940110  1424980760  fbx qmsXNQ  12  2015    02  26
326380277   1424980760  fbx zgVFgP  2   2015    02  26
1256744282  1424980760  fbx GeIFxq  6   2015    02  26
1741961976  1424980760  fbx CiuxZU  8   2015    02  26
2009923690  1424980760  fbx ZmGOvK  2   2015    02  26
1728798342  1424980760  fbx YikDcV  8   2015    02  26
688185292   1424980760  fbx NssSWN  7   2015    02  26

但是,当我尝试在查询失败的任何地方读取或引用特定字段时:

hive> select another_id from some_table limit 10;
java.lang.IllegalArgumentException: Can not create a Path from an empty string
    at org.apache.hadoop.fs.Path.checkPathArg(Path.java:91)
    at org.apache.hadoop.fs.Path.<init>(Path.java:99)
    at org.apache.hadoop.fs.Path.<init>(Path.java:58)
    at org.apache.hadoop.mapred.JobClient.copyRemoteFiles(JobClient.java:745)
    at org.apache.hadoop.mapred.JobClient.copyAndConfigureFiles(JobClient.java:849)
    at org.apache.hadoop.mapred.JobClient.copyAndConfigureFiles(JobClient.java:774)
    at org.apache.hadoop.mapred.JobClient.access$400(JobClient.java:178)
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:991)
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:976)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1438)
    at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:976)
    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:950)
    at org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:448)
    at org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:138)
    at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:138)
    at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:66)
    at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1383)
    at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1169)
    at org.apache.hadoop.hive.ql.Driver.run(Driver.java:982)
    at org.apache.hadoop.hive.ql.Driver.run(Driver.java:902)
    at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:259)
    at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:216)
    at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:412)
    at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759)
    at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:613)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:208)

当我在 where 条件下使用字段时也会发生同样的情况。

我可以在 where 子句中使用分区字段,所以select * from some_table where year=2015工作正常,而select year from some_table limit 10失败并出现上述错误。

HDFS 中的文件如下所示:

{"another_id":6,"user_id":671764813,"some_id":"NtiwgY","event_time":1424980760,"some_string":"fbx"}
{"another_id":10,"user_id":1632511524,"some_id":"AdMybO","event_time":1424980760,"some_string":"fbx"}
{"another_id":6,"user_id":1201817175,"some_id":"GgQJEd","event_time":1424980760,"some_string":"fbx"}

我希望这只是我的表定义的问题。欢迎任何帮助。

4

1 回答 1

0

我没有让它与 HCatalog SerDe 一起使用,但是,我想要的是在 HDFS 中存储 JSON 并将其作为 Hive 表读取,我最终通过使用不同的 SerDe 成功地做到了这一点,你可以在这里找到:

https://github.com/rcongiu/Hive-JSON-Serde

对我来说在 CDH4 上工作得很好。

于 2015-03-05T19:01:54.250 回答