3

在我的应用程序中,我需要使用年份作为它们的键。我认为 Text 更适合 key,因为我们通常按年份对某个度量进行分组,而 IntWritable 用于我们求和或平均的值。但我也认为我们可以使用 IntWritable 作为年份的类型,因为我们可以将年份表示为 int,所以没有什么可以阻止它对吧?我想了解哪个更适合作为关键一年 - 是 Text 还是 IntWritable?

4

2 回答 2

2

Both are suitable, but there is an important difference when it comes to efficiency.

Firstly if you have a 'smaller' number of records what i'm about to discuss is probably so insignificant that it's not worth worrying about. However if you plan to process TB's of data then cycles saved could add up to minutes.

Like Amar points out in his answer a Text will serialize the year value as a series of UTF-8 encoded characters. It actually outputs a VInt for the number of bytes, then the bytes themselves. Typically years are 4 characters in length so a year will be serialized to 5 bytes of data (1 byte length, 4 bytes content).

An IntWritable is always serialized as 4 bytes - but you can hold numbers in the range +/- 2 billion in this byte space - clearly overkill for you year needs (a short 2 bytes holds +/- 32k and a byte holds +/- 128)

So using a Text is less efficient by 1 byte when it comes to serializing the data (compared to a IntWritable).

The other thing to consider is how the raw comparators work for each type:

Text.Comparator will skip over the vint bytes denoting the length and then start comparing the characters byte by byte - so you'll need to get to the 5th byte to compare the year 2000 and 2001 (1 byte length + the difference is in the 4th character). But if the difference is in the first character (say between 1999 and 2000) then the raw comparator has an answer after the 2nd byte.

A IntWritable.Comparator reads in the 4 bytes for each key and then does a int comparison so no matter if you're comparing the number 123456789 and 1, it will have to process all 4 bytes from each key before it can do the comparison.

So in summary, Text is more expensive to serialize but cheaper to compare.

You do have another option depending on your data domain - if for example you only have to represent the years from say 1970, then you can use a ByteWritable to denote the year after 1970 (allowing you to represent the years 1970 - 2097), and will only cost a single byte to serialize and single byte when comparing.

If you need to represent a larger range, you could also use VIntWritable which will be more efficient than VIntWritable (probably only requiring 2 bytes to store years in the range 1970-9999).

于 2013-02-21T01:06:41.637 回答
1

我相信如果IntWritable为您完成这项工作,那么您应该接受它。IntWritable比 更轻量级Text

我的意思是,如果您看到这两个类的实现,您可能会看到IntWritable只有一个属性:

private int value;

在 的实现中Text,您可能会看到它预先具有 2 个属性:

private int length;
private byte[] bytes;

此外,Text类使用标准 UTF8 编码存储文本。它提供了在字节级别序列化、反序列化和比较文本的方法。长度的类型是整数,并使用零压缩格式进行序列化。此外,它提供了字符串遍历的方法,无需将字节数组转换为字符串。还包括用于序列化/反序列化字符串、编码/解码字符串、检查字节数组是否包含有效 UTF8 代码、计算编码字符串长度的实用程序。

所以,如果你不需要所有这些,为什么要使用Text类!一起去IntWritable

于 2013-02-20T18:26:42.820 回答