java - Text.getBytes() 返回意外结果

Question

我从 Text 构造函数中得到了一些没有任何意义的行为。基本上，如果我从 String 构造一个 Text 对象，它不等于我从字节构造的另一个 Text 对象，即使 getBytes() 为这两个对象返回相同的值。

所以我们得到了这样奇怪的东西：

//This succeeds
assertEquals(new Text("ACTACGACCA_0"), new Text("ACTACGACCA_0")); 
//This succeeds
assertEquals((new Text("ACTACGACCA_0")).getBytes(), (new Text("ACTACGACCA_0")).getBytes()); 
//This fails.  Why?
assertEquals(new Text((new Text("ACTACGACCA_0")).getBytes()), new Text("ACTACGACCA_0"));

这在我尝试访问哈希图时表现出来。在这里，我试图根据 org.apache.hadoop.hbase.KeyValue.getRow() 返回的值进行查找：

//This succeeds
assertEquals((new Text("ACTACGACCA_0")).getBytes(), keyValue.getRow()); 
//This returns a value
hashMap.get(new Text("ACTACGACCA_0"));  
//This returns null.  Why?
hashMap.get(new Text(keyValue.getRow()));

那么这里发生了什么，我该如何处理呢？这与编码有关吗？

更新：问题已解决

感谢克里斯为我指明了正确的方向。所以，有一点背景：keyValue 对象是从调用 htable.put() 中捕获的（使用 Mockito ArgumentCaptor）。基本上，我有这段代码：

byte[] keyBytes = matchRow.getKey().getBytes();
RowLock rowLock = hTable.lockRow(keyBytes);                         

Get get = new Get(keyBytes, rowLock);               
SetWritable<Text> toWrite = new SetWritable<Text>(Text.class);
toWrite.getValues().addAll(matchRow.getMatches(hTable, get));

Put put = new Put(keyBytes, rowLock);
put.add(Bytes.toBytes(MatchesByHaplotype.MATCHING_COLUMN_FAMILY), Bytes.toBytes(MatchesByHaplotype.UID_QUALIFIER), 
        SERIALIZATION_HELPER.serialize(toWrite));               
hTable.put(put);

其中 matchRow.getKey() 返回一个文本对象。你看到这里的问题了吗？我正在添加所有字节，包括无效字节。 所以我创建了一个很好的辅助函数来做到这一点：

public byte[] getValidBytes(Text text) {
    return Arrays.copyOf(text.getBytes(), text.getLength());
}

并将该块的第一行更改为：

byte[] keyBytes = SERIALIZATION_HELPER.getValidBytes(matchRow.getKey());

问题解决了！回想起来：哇，多么讨厌的错误！我认为归结为 Text.getBytes() 的行为非常不友好。它不仅返回您可能不期望的内容（无效字节），而且 Text 对象没有仅返回有效字节的函数！你会认为这将是一个常见的用例。也许他们将来会添加这个？

score 3 · Accepted Answer

出于同样的原因，以下失败：

Assert.assertEquals((new Text("ACTACGACCA_0")).getLength(), (new Text("ACTACGACCA_0")).getBytes().length);

getBytes()返回后备字节数组，但根据 API，字节仅有效Text.getLength();

文本.getBytes()

是的，这确实与编码有关 - CharsetEncoder.encode 方法使用 ByteBuffer，其大小最初分配为 12 * 1.1 字节（13）的长度，但实际有效的字节数仍然只有 12（如您使用仅 ASCII 字符）。

java - Text.getBytes() 返回意外结果

更新：问题已解决

1 回答 1

Related

Reference