I'm using kylin. It is a data warehouse tool and it uses hadoop, hive and hbase. It is shipped with sample data so that we can test the system. I was building this sample. It is a multi-step process many of the steps are map-reduce jobs. Second step is Extract Fact Table Distinct Columns
which is a MR job. This job is failing without writing anything in hadoop logs. After digging deeper I find one Exception in logs/userlogs/application_1450941430146_0002/container_1450941430146_0002_01_000004/syslog
2015-12-24 07:31:03,034 WARN [main] org.apache.hadoop.mapred.YarnChild:
Exception running child : java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hive.hcatalog.mapreduce.HCatInputFormat not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)
at org.apache.hadoop.mapreduce.task.JobContextImpl.getInputFormatClass(JobContextImpl.java:174)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:749)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.lang.ClassNotFoundException: Class org.apache.hive.hcatalog.mapreduce.HCatInputFormat not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
... 8 more
2015-12-24 07:31:03,037 INFO [main] org.apache.hadoop.mapred.Task: Runnning cleanup for the task
My question is should I copy all dependencies jar of mapper class to all hadoop node? This job succeeds if I restarts kylin server and resume cube building job. This behavior is observed again when restart it after cleaning up everything.
I am using 5 node cluster, each node is 8 core and 30GB. NameNode is running on one node. DataNode is running on all 5 nodes. For Hbase; HMaster and HQuorumPeer is running on same node as NameNode and HRegionServer is running on all nodes. Hive and Kylin are deployed on Master Node.
Version information:
Ubuntu 12.04 (64 bit)
Hadoop 2.7.1
Hbase 0.98.16
Hive 0.14.0
Kylin 1.1.1