1

I have a lot of object with language code as a key field. Since both Java and Mongodb use UTF-8 natively and since the language codes are ASCII it seems to be that they should take 1 byte per character plus the \0 terminator. So the language code "en" should take only 3 bytes in the BSON object and in the index.

Is this correct? I am wondering whether I save anything by converting my fields to a byte array like:

byte[] lcBytes = langCode.getBytes("ISO-8859-1");

before saving them to Mongodb with the Java driver?

4

1 回答 1

3

根据bson spec,它没有什么区别:

string  ::= int32 (byte*) "\x00"
binary  ::= int32 subtype (byte*)

换句话说,字符串以零结尾(因此浪费了一个字节),而二进制文件需要一个一字节的子类型字段。

当然,完美匹配的字符集可能更有效,因为字节数组本身可能更小(例如,您经常需要的字符不需要三个字节,但只需要一个)。再说一次,我认为这不值得麻烦,因为它使使用正则表达式、map/reduce、js 函数等变得不可能。也许对于非常古老的字符集,但 8859-1 并不太特别。

作为旁注,请记住索引大小限制为大约 1k,因此您不能在索引中抛出很长的字符串(这在性能方面不是一个好主意)。

如果你只需要通过相等查询,也许你可以选择一个哈希来代替?如果您需要存储非常大的字符串(非索引),压缩算法可能是个好主意。

于 2012-06-19T00:27:45.963 回答