java - 从长度为无符号整数的 ByteBuffer 中读取 UTF-8 字符串

Question

我正在尝试通过 java.nio.ByteBuffer 读取 UTF8 字符串。大小是一个未分割的 int，当然，Java 没有。我已将值读入 long 以便我拥有该值。

我遇到的下一个问题是我无法使用 long 创建一个字节数组，并且将他的 long 转换回 int 会导致它被签名。

我也尝试在缓冲区上使用 limit() ，但它同样适用于 int 不长。

我正在做的具体事情是从类文件中读取 UTF8 字符串，因此缓冲区中的内容不仅仅是 UTF8 字符串。

关于如何从 ByteBuffer 读取可能长度为 unsigned int 的 UTF8 字符串的任何想法。

编辑：

SourceDebugExtension_attribute {
       u2 attribute_name_index;
       u4 attribute_length;
       u1 debug_extension[attribute_length];
    }

attribute_name_index
    The value of the attribute_name_index item must be a valid index into the constant_pool table. The constant_pool entry at that index must be a CONSTANT_Utf8_info structure representing the string "SourceDebugExtension".

attribute_length
    The value of the attribute_length item indicates the length of the attribute, excluding the initial six bytes. The value of the attribute_length item is thus the number of bytes in the debug_extension[] item.

debug_extension[]
    The debug_extension array holds a string, which must be in UTF-8 format. There is no terminating zero byte.

    The string in the debug_extension item will be interpreted as extended debugging information. The content of this string has no semantic effect on the Java Virtual Machine.

因此，从技术角度来看，类文件中可能有一个长度为完整 u4（无符号，4 个字节）的字符串。

如果 UTF8 字符串的大小有限制，这些都不是问题（我不是 UTF8 专家，所以可能有这样的限制）。

我可以坚持下去并接受这样一个现实，即不会有这么长的字符串......

score 6 · Accepted Answer

除非您的字节数组超过 2GB（Java 的最大正值），否则将back 转换为 signedint不会有问题。longint

如果您的字节数组长度需要超过 2GB，那么您做错了，尤其是因为这远远超过了 JVM 的默认最大堆大小......

score 1 · Accepted Answer

签署 int 不会是您的主要问题。假设你有一个长度为 40 亿的字符串。您需要一个至少为 4 GB 的 ByteBuffer，一个至少为 4 GB 的 byte[]。将其转换为字符串时，您需要至少 8 GB（每个字符 2 个字节）和一个 StringBuilder 来构建它。（至少 8 GB）所有你需要的，24 GB 来处理 1 个字符串。即使你有很多内存，你也不会得到很多这种大小的字符串。

另一种方法是将长度视为有符号，如果无符号则视为错误，因为无论如何您都没有足够的内存来处理字符串。即使要处理长度为 20 亿 (2^31-1) 的字符串，您也需要 12 GB 才能以这种方式将其转换为字符串。

score 1 · Accepted Answer

Java 数组根据语言规范使用（Java，即已签名）int 进行访问，因此不可能拥有比 Integer.MAX_INT 更长的字符串（由 char 数组支持）

但即使是这么多也无法在一个块中处理 - 如果遇到足够大的字符串，它会完全降低性能并使您的程序在大多数机器上因 OutOfMemoryError 而失败。

您应该做的是处理任何大小合理的字符串，一次说几兆。那么你可以处理的大小没有实际限制。

score 0 · Accepted Answer

我想你可以在 ByteBuffer 之上实现CharSequence 。这将允许您防止“字符串”出现在堆上，尽管大多数处理字符的实用程序实际上都需要一个字符串。即便如此，实际上 CharSequence 也有一个限制。它期望大小以 int 形式返回。

（理论上，您可以创建一个新版本的 CharSequence，它以 long 形式返回大小，但是 Java 中没有任何东西可以帮助您处理该 CharSequence。如果您实现subSequence(...)到返回一个普通的 CharSequence。）

java - 从长度为无符号整数的 ByteBuffer 中读取 UTF-8 字符串

4 回答 4

Related

Reference