0

I must read Avro record serialized in avro files in HDFS. To do that, I use the AvroKeyInputFormat, so my mapper is able to work with the read records as keys.

My question is, how can I control the split size? With the text input format it consists on define the size in bytes. Here I need to define how many records every split will consist of.

I would like to manage every file in my input directory like a one big file. Have I to use CombineFileInputFormat? Is it possible to use it with Avro?

4

1 回答 1

0

拆分遵循逻辑记录边界,最小和最大边界以字节为单位 - 即使拆分边界以字节为单位,文本输入格式也不会在文本文件中换行。

要拆分每个文件,您可以将最大拆分大小设置为 Long.MAX_VALUE,也可以在代码中覆盖 isSplitable 方法并返回 false。

于 2013-06-12T06:54:22.560 回答