Is it the block size of 64 MB for HDFS? Is there any configuration parameter that I can use to change it?
For a mapper reading gzip files, is it true that the number of gzip files must be equal to the number of mappers?
This is dependent on your:
NLineInputFormat
, WholeFileInputFormat
) work on boundaries other than the block size. In general though anything extended from FileInputFormat
will use the block boundaries as guidesFileInputFormat
configuration properties mapred.min.split.size
and mapred.max.split.size
usually default to 1
and Long.MAX_VALUE
, but if this is overridden in your system configuration, or in your job, then this will change the amunt of data processed by each mapper, and the number of mapper tasks spawned.CombineFileInputFormat
, CompositeInputFormat
)So if you have file with a block size of 64m, but either want to process more or less than this per map task, then you should just be able to set the following job configuration properties:
mapred.min.split.size
- larger than the default, if you want to use less mappers, at the expense of (potentially) losing data locality (all data processed by a single map task may now be on 2 or more data nodes)mapred.max.split.size
- smaller than default, if you want to use more mappers (say you have a CPU intensive mapper) to process each fileIf you're using MR2 / YARN then the above properties are deprecated and replaced by:
mapreduce.input.fileinputformat.split.minsize
mapreduce.input.fileinputformat.split.maxsize