hadoop - 为什么 BytesWritable.setSize(size) 将字节空间设为 1.5*size？

Question

我正在使用hadoop编写程序。我的问题代码如下（代码在映射器中）：

byte[] tmp = new byte[2];
tmp[0] = 0x01;
tmp[1] = 0x02;
BytesWritable outputKey = new BytesWritable();
outputKey.set(tmp, 0, 2);

然而，当我操作从 mapper 得到的 reducer 中的 key 时，却让我大吃一惊：

byte[] reducerKey = key.getBytes();

reducerKey 如下：

reducerKey[0] -> 0x01;
reducerKey[1] -> 0x02;
reducerKey[2] -> 0x00;

为什么我输入的 tmp 是 2 个字节的长度，但是当我得到时，它变成了 3 个字节的长度。

然后我阅读了 BytesWritable.setSize(size) 的源代码，我发现了这个：

public void setSize(int size) {
 if (size > getCapacity()) {
  setCapacity(size * 3 / 2);
 }
this.size = size;}

那么当字节写入BytesWritable时，为什么数据结构会为byte[]创建一个1.5*size的空间呢？我认为这是浪费空间，因为 0.5 * 大小的空间是无用的。

score 1 · Accepted Answer

This is a common programming practice to amortize the cost of dynamic array resizes.

Now why it isn't an issue, and a good default behavior, for Hadoop writables ?

Writable objects are usually singletons since they can be, and are, reused. You usually want to size them to fit your largest buffer. Creating a new Writable each time, waste time and can put pressure on the GC. It makes sense to make them a little larger than the largest already used buffer.
If you want to avoid the extra room you can use the BytesWritable(byte[] bytes) constructor or setCapacity. Please note that the constructor is much more efficient than set() since it does not have to copy the data. Only two references have to be set.

1 回答 1