6

最近,我正在阅读 hadoop 的权威指南。我有两个问题:

1.我看到一段自定义Partitioner的代码:

public class KeyPartitioner extends Partitioner<TextPair, Text>{

    @Override
    public  int getPartition(TextPair key, Text value, int numPartitions){
        return (key.getFirst().hashCode()&Interger.MAX_VALUE)%numPartitions;
    }
}

这对 &Integer.MAX_VALUE 意味着什么?为什么要使用 & 运算符?

2.我还想为 IntWritable 编写一个自定义分区器。那么直接使用 key.value%numPartitions 是否可行且最好?

4

1 回答 1

12

Like I already wrote in the comments, it is used to keep the resulting integer positive.

Let's use a simple example using Strings:

String h = "Hello I'm negative!";
int hashCode = h.hashCode();

hashCode is negative with the value of -1937832979.

If you would mod this with a positive number (>0) that denotes the partition, the resulting number is always negative.

System.out.println(hashCode % 5); // yields -4

Since partitions can never be negative, you need to make sure the number is positive. Here comes a simple bit twiddeling trick into play, because Integer.MAX_VALUE has all-ones execpt the sign bit (MSB in Java as it is big endian) which is only 1 on negative numbers.

So if you have a negative number with the sign bit set, you will always AND it with the zero of the Integer.MAX_VALUE which is always going to be zero.

You can make it more readable though:

return Math.abs(key.getFirst().hashCode() % numPartitions);

For example I have done that in Apache Hama's partitioner for arbitrary objects:

 @Override
 public int getPartition(K key, V value, int numTasks) {
    return Math.abs(key.hashCode() % numTasks);
 }
于 2013-05-18T16:10:23.150 回答