hadoop - hadoop中的输入拆分和阻塞

Question

我的文件大小为 100 MB，默认块大小为 64 MB。如果我不设置输入拆分大小，则默认拆分大小将是块大小。现在拆分大小也是 64 MB。

当我将这个 100 MB 的文件加载到 HDFS 中时，这个 100 MB 的文件将分成 2 个块。即 64 MB 和 36 MB。例如下面是一首 100 MB 大小的诗歌歌词。如果我将此数据加载到 HDFS 中，例如从第 1 行到第 16 行的一半，正好是 64 MB 作为一个拆分/块（直到“它成功了”）和第 16 行的剩余一半（孩子们笑着玩耍）到文件末尾作为第二个块 (36 MB)。将有两个映射器工作。

我的问题是第一个映射器将如何考虑第 16 行（即块 1 的第 16 行），因为该块只有一半的行，或者第二个映射器将如何考虑块 2 的第一行，因为它也有一半线。

Mary had a little lamb
Little lamb, little lamb
Mary had a little lamb
Its fleece was white as snow
And everywhere that Mary went
Mary went, Mary went
Everywhere that Mary went
The lamb was sure to go

He followed her to school one day
School one day, school one day
He followed her to school one day
Which was against the rule
It made the children laugh and play
Laugh and play, laugh and play
It made the children laugh and play
To see a lamb at school

And so the teacher turned him out
Turned him out, turned him out
And so the teacher turned him out
But still he lingered near
And waited patiently
Patiently, patiently
And wai-aited patiently
Til Mary did appear

或者在拆分 64 MB 时，而不是拆分单行，hadoop 会考虑整行 16？

score 1 · Accepted Answer

在 hadoop 中，数据是根据 Input split size 和 block size 读取的。

文件根据大小分为多个 FileSplit。每个输入拆分都使用与输入中的偏移量对应的起始参数进行初始化。
当我们初始化 LineRecordReader 时，它会尝试实例化一个 LineReader 开始读取行。
如果定义了 CompressionCodec，它会处理边界。所以如果 InputSplit 的开头不是 0，则回溯 1 个字符，然后跳过第一行，（遇到 \n 或 \r\n）。Backtrack 确保您不会跳过有效行。

这是代码：

if (codec != null) {
   in = new LineReader(codec.createInputStream(fileIn), job);
   end = Long.MAX_VALUE;
} else {
   if (start != 0) {
     skipFirstLine = true;
     --start;
     fileIn.seek(start);
   }
   in = new LineReader(fileIn, job);
}
if (skipFirstLine) {  // skip first line and re-establish "start".
  start += in.readLine(new Text(), 0,
                    (int)Math.min((long)Integer.MAX_VALUE, end - start));
}
this.pos = start;

由于拆分是在客户端计算的，因此映射器不需要按顺序运行，每个映射器都已经知道是否需要丢弃第一行。

因此，在您的情况下，第一个块 B1 将从偏移量 0 读取数据到“它让孩子们笑着玩耍”行

块 B2 将从“To see a lamb at school”行读取数据到最后一行偏移量。

您可以参考这些以供参考：

https://hadoopabcd.wordpress.com/2015/03/10/hdfs-file-block-and-input-split/
Hadoop 进程记录如何跨块边界拆分？

score 0 · Accepted Answer

第一个映射器将读取整个第 16 行（它将继续读取，直到找到行尾字符）。

如果您还记得，为了应用 mapreduce，您的输入必须以键值对的形式组织。对于在 Hadoop 中恰好是默认设置的 TextInputFormat，这些对是：(offset_from_file_beginning, line_of_text)。文本被分解为基于 '\n' 字符的键值对。因此，如果一行文本超出输入拆分的大小，映射器将继续读取，直到找到'\n'。

hadoop - hadoop中的输入拆分和阻塞

2 回答 2

Related

Reference