执行外连接时出现“不是 SequenceFile 错误”。它曾经在相同的设置和类似的表下工作,但现在我不知道发生了什么变化,因此在大键空间上加入相当大的表时出现此错误。
我正在使用 YARN 运行 Hive 0.13.1 Cloudera 5.3.0。两个表都存储为 orc tblproperties ("orc.compress" = "SNAPPY")。
存储信息:
SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde
InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
Compressed: No
此任务的诊断消息:
java.lang.RuntimeException:
org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException:
hdfs://my_cluster:9000/user/hive/warehouse/my_table/000000_0 not a
SequenceFile at
org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:283)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:506)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:447)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs
(UserGroupInformation.java:1642) at
org.apache.hadoop.mapred.Child.main(Child.java:262)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException:
java.io.IOException: hdfs://my_cluster:9000/user/hive/warehouse/my_table
/000000_0 not a SequenceFile at
org.apache.hadoop.hive.ql.exec.persistence.RowContainer.first
(RowContainer.java:237) at org.apache.hadoop.hive.
FAILED: Execution Error, return code 2 from
org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Stage-Stage-1: Map: 35 Reduce: 1 Cumulative CPU: 2742.67 sec HDFS
Read: 8762733372 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 45 minutes 42 seconds 670 msec
在我的 .hiverc
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions=10000;
set hive.exec.max.dynamic.partitions.pernode=10000;
set hive.exec.max.created.files=150000;
set hive.error.on.empty.partition=true;
set hive.cli.print.header=true;
set hive.optimize.s3.query=true;
set hive.auto.convert.join=true;
set mapred.child.java.opts=-Xmx2048m;
set hive.error.on.empty.partition=false;
set hive.hadoop.supports.splittable.combineinputformat=true;
set hive.enforce.bucketing=true;
set hive.optimize.bucketmapjoin=true;
set hive.mapjoin.smalltable.filesize=50000000;
set hive.resultset.use.unique.column.names=false;
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled = true;
我尝试将两个表都声明为序列文件,但在全尺寸表上存在不同的错误,但在小样本上却没有:IndexOutOfBound。
Metastore 是 MySQL。
Hive / Hadoop 设置的完整列表很长,但我会查找它 - 只是不知道要查找什么。
如果这与 IO 或损坏的 HDFS 有关,我该如何检查 HDFS 的运行状况?