我的文件大小为 100 MB,默认块大小为 64 MB。如果我不设置输入拆分大小,则默认拆分大小将是块大小。现在拆分大小也是 64 MB。
当我将这个 100 MB 的文件加载到 HDFS 中时,这个 100 MB 的文件将分成 2 个块。即 64 MB 和 36 MB。例如下面是一首 100 MB 大小的诗歌歌词。如果我将此数据加载到 HDFS 中,例如从第 1 行到第 16 行的一半,正好是 64 MB 作为一个拆分/块(直到“它成功了”)和第 16 行的剩余一半(孩子们笑着玩耍)到文件末尾作为第二个块 (36 MB)。将有两个映射器工作。
我的问题是第一个映射器将如何考虑第 16 行(即块 1 的第 16 行),因为该块只有一半的行,或者第二个映射器将如何考虑块 2 的第一行,因为它也有一半线。
Mary had a little lamb
Little lamb, little lamb
Mary had a little lamb
Its fleece was white as snow
And everywhere that Mary went
Mary went, Mary went
Everywhere that Mary went
The lamb was sure to go
He followed her to school one day
School one day, school one day
He followed her to school one day
Which was against the rule
It made the children laugh and play
Laugh and play, laugh and play
It made the children laugh and play
To see a lamb at school
And so the teacher turned him out
Turned him out, turned him out
And so the teacher turned him out
But still he lingered near
And waited patiently
Patiently, patiently
And wai-aited patiently
Til Mary did appear
或者在拆分 64 MB 时,而不是拆分单行,hadoop 会考虑整行 16?