我想在 Python 上的 mapreduce 中读取 ORC 文件。我尝试运行它:
hadoop jar /usr/lib/hadoop/lib/hadoop-streaming-2.6.0.2.2.6.0-2800.jar
-file /hdfs/price/mymapper.py
-mapper '/usr/local/anaconda/bin/python mymapper.py'
-file /hdfs/price/myreducer.py
-reducer '/usr/local/anaconda/bin/python myreducer.py'
-input /user/hive/orcfiles/*
-libjars /usr/hdp/2.2.6.0-2800/hive/lib/hive-exec.jar
-inputformat org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
-numReduceTasks 1
-output /user/hive/output
但我得到错误:
-inputformat : class not found : org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
我发现了一个类似的问题OrcNewInputformat as a inputformat for hadoop streaming但答案尚不清楚
请举例说明如何在 hadoop 流中正确读取 ORC 文件。