c++ - 非常快的序列化。随机访问 C++

Question

在 C++11 中，我可以使用什么来将数据快速序列化为多个文件（为避免数据冗余，我假设我会将数据拆分为多个表并在其 id 号上连接它们）？

我考虑使用：

fstream.read()使用,访问fstream.write()的简单二进制文件
使用mmap.
函数 google protobuf（如果我可以访问随机元素而不是迭代所有元素）。

所有表都将由具有以下数据类型的列组成： uint8, uint16, uint32, uint64, string。

score 1 · Accepted Answer

快速随机访问将是这里的挑战。实现这一目标的最简单方法是保持每一行的大小不变。使用protobufs 没有简单的方法可以做到这一点，除非你假设一个保守的最大尺寸。使用前两个选项中的任何一个都应该相对容易做到这一点（假设您对字符串的大小有合理的限制）。

但是，您可以任意变得更复杂。使用protobufs 可能会比简单的序列化使用更少的空间，因此您将有剩余的内存来构建索引。即使是相对较小的索引（例如，从表行号映射到每 100 行的文件偏移量）也会为您提供快速随机访问并使用更少的空间。当然，这比简单的每行大小相同的方法要复杂得多。

score 1 · Accepted Answer

分开你的数字和字符串存储。

稀疏表，数值数据

对数字类型的列使用列存储。列存储不存储 NULL 值，并提供允许您重现表行的连接逻辑。

不是单查找随机访问，但空间权衡可能会获胜，特别是如果列存储的索引保留在内存中。

密集表，数值数据

MMAP 用于读取的文件。以恒定的宽度逐行存储您的数据。您可能需要调整文件打开参数以获得所需的缓存和预读优势。

使用 fstream.write() 进行写作可能更快。

字符串数据

Based on your suggestions, it sounds like your design allows writing the table all at once, and then performing read-only random access from that point forward. If so, look at Google's SSTable. It's a storage layer that provides efficient random access for variable length data.

score 1 · Accepted Answer

The MIT-licensed serialization library Cap’n Proto might provide the functionality you need. It was written by the main author of Google Protobuf Kenton Varda.

A quote from him: Cap'n Proto uses pointers to support fully random access. This means you can do things like mmap() a giant file and pull out one inner object without actually processing the whole thing, or access sub-objects in a different order than they were written

The data types you are mentioning (uint8, uint16, uint32, uint64, string) are all supported by the Cap'n Proto Schema Language

c++ - 非常快的序列化。随机访问 C++

3 回答 3

Related

Reference