mapreduce - 如何过滤hadoop mapreduce中的任何列镶木地板

Question

我将数据以镶木地板格式存储在 hdfs 中。我写 mapred 来运行这个数据成功，我想过滤 map 中的任何列输入，

如何在hadoop mapreduce中过滤任何colum parquet

score 0 · Accepted Answer

您应该parquet.read.schema在 mr 作业配置中设置属性，指定包含所需列的模式字符串（它是文件 parquet 模式的投影）。当然，使用ExampleInputFormat.class.

这个问题我困惑了很久，直到看了源码ParquetInputFormat.java GroupReadSuport.java等等。ParquetInputFormat使用request schema来读取。

1 回答 1