1

I'm creating a DBMS (basically a software handling SQL queries) strictly for fun and as a learning experience. And I need to know the best way to separate values and rows.

For the table configuration I use XML as it's a good way to store information. Although this can not be done with all inserted rows as all the xml tags will take up a LOT of memory. I also thought about serializing all the objects representing a database (as I use Java) to store the data but my guess is that that too would take up a lot of memory.

So the only thing I could think of was using some value separator and row separator to take up minimum amount of memory. Although the problem with separators as single-characters (if I use multicharacter I might as well use XML) is that problems will occur if that separator is in one of the values. So I thought about if I could use a hexadecimal character with no attached symbol. Does that exist? And if so, is it a good approach? One problem is if I, in the future, starts allowing BLOBs. Those contain binary data and might contain my value separator. What is the best solution to this?

Tell me what you think! I'm open for discussion. Also, if anyone knows how MySQL (or some other widely used SQL engine) stores data, that could be interesting.

A new idea I got

What if you can read the entire table into a TreeSet loaded with different comparators based on what you are searching on/order by. Then the search would be equally fast what ever you are searching on. The downside of this is of course that the whole file will have to be written into objects that are placed in the TreeSet, could be a lot of RAM. What do you think?

4

1 回答 1

3

我首先想到的是索引。如果您继续开发 DBMS,无论如何都会遇到对各种类型索引的需求(二叉树、哈希映射等)。索引需要内容的直接映射才能有效。不会按顺序扫描文件中的行。

  • 如果您的行具有固定长度(取决于表数据定义),则您可以在记录之间以及列之间具有固定的偏移量。

  • 如果记录的长度不同,您可以按照上述相同的方式处理固定长度的列。对于动态大小的字段,文件中的另一个部分可能有一个固定大小的引用(偏移值),其中包含动态大小的值。零引用可以被视为 NULL,因为您的文件很可能有一个标题。

  • 另一种选择是维护一个行索引,其中包含行数据的单独偏移量,可能具有 2^N 粒度(分页)。偏移量应该与实际数据的对齐方式相匹配,特别是如果确实将文件映射到内存中。首先,该索引可以是用于二进制搜索的简单有序列表,可能位于单独的文件中。但是,正如您所说,这将需要一些列分隔符。我会使用某种字段长度编码,因为它不需要对实际字段内容进行特殊处理(例如转义)。在另一个结构中维护字段长度可能是有效的,该结构由该索引映射或直接嵌入该索引(因为动态列的数量是固定的)。负字段长度也可以指定 NULL 值。

  • 您可以查看 sqlite 的实现以获得想法,因为它具有非常紧凑的存储布局。

于 2013-11-07T18:42:01.093 回答