hadoop - 在hadoop map reduce中读取excel文件

Question

我正在尝试读取包含一些数据以在 hadoop 中聚合的 Excel 文件。map reduce 程序似乎工作正常，但输出生成的格式不可读。我是否需要对 Hadoop 中的 Excel 文件使用任何特殊的 InputFormat 阅读器Map Reduce ?.我的配置如下

   Configuration conf=getConf();
Job job=new Job(conf,"LatestWordCount");
job.setJarByClass(FlightDetailsCount.class);
Path input=new Path(args[0]);
Path output=new Path(args[1]);
FileInputFormat.setInputPaths(job, input);
FileOutputFormat.setOutputPath(job, output);
job.setMapperClass(MapClass.class);
job.setReducerClass(ReduceClass.class);
//job.setCombinerClass(ReduceClass.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
//job.setOutputKeyClass(Text.class);
//job.setOutputValueClass(Text.class);
System.exit(job.waitForCompletion(true)?0:1);
return 0;

输出结果如下所示 ��KW ��O�A��]n��Ε��r3�\n"��p�饚6W�jJ��9W�f=��9ml��dR y/Ք��7�^�i ��M*Ք�^nz��l��^�)��妗j�(��dRͱ/7�TS*��M//7�TS�� &�jZ��o��TSR�7�@�)�o��TӺ��5{%��+��ۆ�w6-��=�e�_}m�)~��ʅ� ��: #�j�]��u��>

score 5 · Accepted Answer

我不知道是否有人真的为 MS Excel 文件开发了一个自定义 InputFormat（我对此表示怀疑，快速研究没有发现任何结果），但您肯定无法使用 TextInputFormat 读取 Excel 文件。XSL 文件是二进制文件。

解决方案：将您的 Excel 文件导出为 CSV 或 TSV，然后您就可以使用 TextInputFormat 加载它们。

score 0 · Accepted Answer

我知道这有点晚了，但现在有人已经创建了 excel 输入格式作为此类问题的标准解决方案。阅读这个 - https://sreejithrpillai.wordpress.com/2014/11/06/excel-inputformat-for-hadoop-mapreduce/

有一个带有代码库的 github 项目。

看这里 - https://github.com/sreejithpillai/ExcelRecordReaderMapReduce/

score 0 · Accepted Answer

您还可以使用 HadoopOffice 库，它允许您使用 Hadoop 和 Spark 读取/写入 Excel。它在 Maven Central 和 Spark 包中可用。

https://github.com/ZuInnoTe/hadoopoffice/wiki

hadoop - 在hadoop map reduce中读取excel文件

3 回答 3

Related

Reference