hadoop - file storage, block size and input splits in Hadoop

Question

Consider this scenario:

I have 4 files each 6 MB each. HDFS block size is 64 MB.

1 block will hold all these files. It has some extra space. If new files are added, it will accommodate here

Now when the input splits are calculated for Map-reduce job by Input format, (split size are usually HDFS block size so that each split can be loaded into memory for processing, there by reducing seek time.)

how many input splits are made here:

is it one because all the 4 files are contained with in a block?
or is it one input split per file?
how is this determined? what if I want all files to be processed as a single input split?

score 3 · Accepted Answer

1 个块将保存所有这些文件。它有一些额外的空间。如果添加了新文件，它将在此处容纳 [...] 是不是因为所有 4 个文件都包含在一个块中？

你实际上有4个街区。所有文件是否都可以放入一个块中并不重要。

编辑： 块属于一个文件，而不是相反。HDFS 旨在存储几乎肯定会大于块大小的大文件。每个块存储多个文件会给名称节点增加不必要的复杂性......

现在不是文件blk0001，而是blk0001 {file-start -> file-end}.
你如何附加到文件？
删除文件时会发生什么？
ETC...

还是每个文件一个输入拆分？

每个文件仍然有 1 个拆分。

这是如何确定的？

这是怎么回事。

如果我希望将所有文件作为单个输入拆分处理怎么办？

使用不同的输入格式，例如MultipleFileInputFormat.

score 0 · Accepted Answer

每个文件将存储在一个单独的块中，但文件不会占用整个底层存储块，它将使用更少的物理存储。
HDFS 不适用于较小的文件 -看看这个

hadoop - file storage, block size and input splits in Hadoop

2 回答 2

Related

Reference