java - 处理包含混合编码的字节数组的有效方法

Question

我在字节数组中有一些数据，之前使用非阻塞 IO 从网络会话中检索到（以促进多个通道）。

数据的格式本质上是

varint: length of text
UTF-8: the text

我试图找出一种有效提取文本的方法，因为它的起始位置是不确定的（因为 varint 的长度是可变的）。我有一些非常接近的东西，但对于一个小问题，这里是：

import com.clearspring.analytics.util.Varint;
// Some fields for your info
private final byte replyBuffer[] = new byte[32768];
private static final Charset UTF8 = Charset.forName ("UTF-8");

// ...
  // Code which extracts the text
    ByteArrayInputStream byteInputStream = new ByteArrayInputStream(replyBuffer);
    DataInputStream inputStream = new DataInputStream(byteInputStream);
    int textLengthBytes;

    try {
      textLengthBytes = Varint.readSignedVarInt (inputStream);
    }
    catch (IOException e) {
     // I don't think we should ever get an IOException when using the
     // ByteArrayInputStream class
       throw new RuntimeException ("Unexpected IOException", e);
    }
    int offset = byteInputStream.pos(); // ** Here lies the problem **
    String textReceived = new String (replyBuffer, offset, textLengthBytes, UTF8);

这个想法是缓冲区中的文本偏移量由 byteInputStream.pos() 指示。但是，该方法受到保护。

在我看来，在解码 varint 后获得文本“其余部分”的唯一方法是使用将其全部复制到另一个缓冲区的东西，但这对我来说似乎相当浪费。

直接从底层缓冲区构造字符串应该没问题，因为在此之后我不再关心 byteInputStream 或 inputStream 的状态。所以我试图找出一种计算偏移量的方法，或者换句话说，有多少字节 Varint.readSignedVarInt 消耗。也许有一种有效的方法可以将 Varint.readSignedVarInt 返回的整数值转换为编码中占用的字节数？

score 1 · Accepted Answer

有几种方法可以找到字节数组中字符串的偏移量：

您可以创建一个子类ByteArrayInputStream，使您可以访问该pos字段。它具有受保护的访问权限，以便子类可以使用它。
如果您想要更普遍适用的东西，请创建一个FilterInputStream计算已读取字节数的子类。这是更多的工作，但可能不值得付出努力。

计算编码的字节数varint。最多有5个。

int offset = 0; while (replyBuffer[offset++] < 0);

计算编码 a 所需的字节数varint。每个字节编码 7 位，因此您可以取最高 1 位的位置并除以 7。

// "zigzag" encoding required since you store the length as signed
int textLengthUnsigned = (textLengthBytes<<2) ^ (textLengthBytes >> 31);
int offset = (31 - Integer.numberOfLeadingZeros(textLengthUnsigned))/7 + 1

java - 处理包含混合编码的字节数组的有效方法

1 回答 1

Related

Reference