input - MapReduce 中输入拆分的主要原因是什么？

Question

在MapReduce 论文中描述了输入文件被划分为 M 个输入分割。我知道 Hadoop 中的 HDFS 会自动分区为 64 MB（默认）的块，然后将这些块复制到集群中的几个其他节点以提供容错。我想知道 HDFS 中文件的这种分区是否意味着提到的 MapReduce 论文中描述的输入拆分。容错是这种分裂的单一原因还是有更重要的原因？

如果我在没有分布式文件系统的节点集群上使用 MapReduce（数据仅在具有通用文件系统的本地磁盘上）怎么办？我需要在映射阶段之前拆分本地磁盘上的输入文件吗？

谢谢您的回答。

score 3 · Accepted Answer

想添加一些缺失的概念（ans 让我感到困惑）

高密度文件系统

文件存储为块（故障/节点容限）。块大小（64MB-128MB）64MB。所以一个文件被分成块，块存储在集群上的不同节点上。一个块被复制因子复制（默认=3）。

Map-Reduce

已经存储在 HDFS 中的文件在逻辑上分为INPUT-SPLITS。分割大小可以由用户设置

Property name           Type   Default value

mapred.min.split.size   int     1
mapred.max.split.sizea  long    Long.MAX_VALUE.

然后通过以下公式计算拆分大小：

最大（最小尺寸，最小（最大尺寸，块尺寸））

注意：：拆分是合乎逻辑的

希望现在回答您的问题

 I'd like to know if this partitioning of files in HDFS means the input splitting described in mentioned MapReduce papers.

不，根本不是 HDFS 块和 Map-Reduce 拆分是一回事。

Is fault tolerance single reason of this splitting or are there more important reasons?

不，分布式计算将是原因。

And what if I have MapReduce over cluster of nodes without distributed file system (data only on local disks with common file sytem)? Do I need to split input files on local disk before map phase?

在您的情况下，我猜，是的，您必须将输入文件拆分为 Map Phase，并且您还必须将中间输出（来自 Mapper）拆分为 Reduce Phase。其他问题：数据一致性、容错性、数据丢失（在 hadoop 中 =1%）。

Map-Reduce 是为分布式计算而设计的，因此在非分布式环境中使用 Map-Reduce 是没有用的。

谢谢

score 1 · Accepted Answer

I'd like to know if this partitioning of files in HDFS means the input splitting described in mentioned MapReduce papers.

不，MapReduce 中的输入拆分是为了在 reduce 阶段利用多个处理器的计算能力。映射器接收大量数据并将数据拆分为逻辑分区（大多数时间由程序员自定义映射器实现指定）。然后，这些数据进入各个节点，在这些节点中，称为 reducer 的独立进程执行数据处理，然后最终对结果进行整理。

Is fault tolerance single reason of this splitting or are there more important reasons?

不，这不是这样做的唯一原因。您可以将其与文件系统级别的块大小进行比较，以确保将数据传输到块中、按块压缩数据以及分配 I/O 缓冲区。

input - MapReduce 中输入拆分的主要原因是什么？

2 回答 2

Related

Reference