mapreduce - 生成的地图任务数量是否取决于作业节点的数量？

Question

生成的 map() 数量等于 64MB 输入数据块的数量。假设我们有 2 个 1MB 大小的输入文件，这两个文件都将存储在一个块中。但是当我使用 1 个名称节点和 2 个作业节点运行我的 MR 程序时，我看到生成了 2 个 map()，每个文件一个。这是因为系统试图在 2 个节点之间拆分作业，即，

Number of map() spawned = number of 64MB blocks of input data * number of jobnodes ?

此外，在 mapreduce 教程中，它比一个 10TB 的文件（块大小为 128KB）编写的，将产生 82000 个地图。但是，根据映射数量仅取决于块大小的逻辑，必须生成 78125 个作业（10TB/128MB）。我不明白产生了多少额外的工作？如果有人可以分享您对此的想法，那就太好了？谢谢。:)

score 0 · Accepted Answer

此外，输入拆分大小和块大小并不总是得到尊重。如果输入文件是 gzip，则不可拆分。因此，如果其中一个 gzip 文件为 1500mb，则不会对其进行拆分。最好将块压缩与 Snappy 或 LZO 以及序列文件格式一起使用。

此外，如果输入是 HBASE 表，则不使用输入拆分大小。在 HBase 表的情况下，仅拆分是为了保持表的正确区域大小。如果 table 没有正确分布，请手动将 table 拆分为多个区域。

score 0 · Accepted Answer

Number of mappers depends on just one thing, the no of InputSplits created by the InputFormat you are using(Default is TextInputFormat which creates splits taking \n as the delimiter). It does not depend on the no. of nodes or the file or the block size(64MB or whatever). It's very good if the split is equal to the block. But this is just an ideal situation and cannot be guaranteed always. MapReudce framework tries its best to optimise the process. And in this process things like creating just 1 mapper for the entire file happen(if the filesize is less than the block size). Another optimization could be to create lesser number of mappers than the number of splits. For example if your file has 20 lines and you are using TextInputFormat then you might think that you'll get 20 mappers(as no. of mappers = no. of splits and TextInputFormat creates splits based on \n). But this does not happen. There will be unwanted overhead in creating 20 mappers for such a small file.

And if the size of a split is greater than the block size, the remaining data is moved in from the other remote block on a different machine in order to gets processed.

About the MapReduce tutorial :

If you have 10TB data, then - (10*1024*1024)/128 = 81,920 mappers, which almost = 82,000

Hope this clears some of the things.

score 0 · Accepted Answer

默认情况下，每个输入文件生成一个映射器，如果输入文件的大小大于拆分大小（通常与块大小保持相同），那么对于该文件，映射器的数量将是文件大小/拆分大小的 ceil。

现在说你有 5 个输入文件，分割大小保持为 64 MB

file1 - 10 MB
file2 - 30 MB
file3 - 50 MB
file4 - 100 MB
file5 - 1500 MB

启动的映射器数量

file1 - 1
file2 - 1
file3 - 1
file4 - 2
file5 - 24

总映射器 - 29

mapreduce - 生成的地图任务数量是否取决于作业节点的数量？

3 回答 3

Related

Reference