2

我正在使用hadoop编写程序。我的问题代码如下(代码在映射器中):

byte[] tmp = new byte[2];
tmp[0] = 0x01;
tmp[1] = 0x02;
BytesWritable outputKey = new BytesWritable();
outputKey.set(tmp, 0, 2);

然而,当我操作从 mapper 得到的 reducer 中的 key 时,却让我大吃一惊:

byte[] reducerKey = key.getBytes();

reducerKey 如下:

reducerKey[0] -> 0x01;
reducerKey[1] -> 0x02;
reducerKey[2] -> 0x00;

为什么我输入的 tmp 是 2 个字节的长度,但是当我得到时,它变成了 3 个字节的长度。

然后我阅读了 BytesWritable.setSize(size) 的源代码,我发现了这个:

public void setSize(int size) {
 if (size > getCapacity()) {
  setCapacity(size * 3 / 2);
 }
this.size = size;}

那么当字节写入BytesWritable时,为什么数据结构会为byte[]创建一个1.5*size的空间呢?我认为这是浪费空间,因为 0.5 * 大小的空间是无用的。

4

1 回答 1

1

This is a common programming practice to amortize the cost of dynamic array resizes.

Now why it isn't an issue, and a good default behavior, for Hadoop writables ?

  1. Writable objects are usually singletons since they can be, and are, reused. You usually want to size them to fit your largest buffer. Creating a new Writable each time, waste time and can put pressure on the GC. It makes sense to make them a little larger than the largest already used buffer.
  2. If you want to avoid the extra room you can use the BytesWritable(byte[] bytes) constructor or setCapacity. Please note that the constructor is much more efficient than set() since it does not have to copy the data. Only two references have to be set.
于 2013-04-12T06:24:08.613 回答