mongodb - Can I save space in my Mongodb indexes by converting ASCII strings to bytes?

Question

I have a lot of object with language code as a key field. Since both Java and Mongodb use UTF-8 natively and since the language codes are ASCII it seems to be that they should take 1 byte per character plus the \0 terminator. So the language code "en" should take only 3 bytes in the BSON object and in the index.

Is this correct? I am wondering whether I save anything by converting my fields to a byte array like:

byte[] lcBytes = langCode.getBytes("ISO-8859-1");

before saving them to Mongodb with the Java driver?

score 3 · Accepted Answer

根据bson spec，它没有什么区别：

string  ::= int32 (byte*) "\x00"
binary  ::= int32 subtype (byte*)

换句话说，字符串以零结尾（因此浪费了一个字节），而二进制文件需要一个一字节的子类型字段。

当然，完美匹配的字符集可能更有效，因为字节数组本身可能更小（例如，您经常需要的字符不需要三个字节，但只需要一个）。再说一次，我认为这不值得麻烦，因为它使使用正则表达式、map/reduce、js 函数等变得不可能。也许对于非常古老的字符集，但 8859-1 并不太特别。

作为旁注，请记住索引大小限制为大约 1k，因此您不能在索引中抛出很长的字符串（这在性能方面不是一个好主意）。

如果你只需要通过相等查询，也许你可以选择一个哈希来代替？如果您需要存储非常大的字符串（非索引），压缩算法可能是个好主意。

mongodb - Can I save space in my Mongodb indexes by converting ASCII strings to bytes?

1 回答 1

Related

Reference