0

I am looking for a way to store a big amount of data in the file or files. The additional requirement is: it should be indexed, two indexes on integer fields should allow selecting a specific set of data very fast.

Details: the data record is a fixed-length set of 3 integers like this:

A (int) | B (int) | N (int)

A and B are indexable columns while N is just a data value.

This data set may contain billions of records (for example 30M) and there should be a way to select all records with A= as fast as possible. Or records with B= as fast as possible.

I can not use any other technologies except MySQL and PHP and you can say: "Wow, you can use MySQL!". Sure. I am already using it, but because of MySQL's extra data, my database takes 10 times more space than it should, plus index data.

So I am looking for a file-based solution.

Are there any ready algorithms to implement this? Or source code solution?

Thank you!

Update 1:

CREATE TABLE `w_vectors` (
    `wid` int(11) NOT NULL,
    `did` int(11) NOT NULL,
    `wn` int(11) NOT NULL DEFAULT '0',
    UNIQUE KEY `did_wn` (`did`,`wn`),
    KEY `wid` (`wid`),
    KEY `did` (`did`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_520_ci

Update 2:

The goal of this table is to store document-vs-words vectors for a word-based search application. This table stores all the words from all the documents in compact form (wid is the word ID from the word vocabulary, did is the document ID and wn is the number of the word in the document). This works pretty well, however, in case you have, let's say, 1000000 documents, each document contains average of 10k words, this table becomes VERY VERY huge like 10 billion rows! And with row size 34 bytes it becomes a 340 Gb structure for just 1 million documents... not good, right?

I am looking for a way to optimize this.

4

2 回答 2

1

如果你必须使用 MySQL,你可以尝试:

  • 将表转换为 MyISAM,它比 InnoDB 占用更少的空间,并且允许每个表有多个索引。我很少推荐 MyISAM,因为它不支持 ACID 属性。但是,如果您选择使用基于文件的解决方案,那么它也不支持 ACID。

  • 使用 MySQL 中压缩数据的各种解决方案之一。这里有一个很好的比较:https ://www.percona.com/blog/2018/11/23/compression-options-in-mysql-part-1/

于 2021-02-11T22:07:48.753 回答
1

你也可以改变

UNIQUE KEY `did_wn` (`did`,`wn`)

PRIMARY KEY(did, wn)

并摆脱

INDEX(did)

因为该复合索引负责对did.

有了那个PK,这些将非常有效:

... WHERE did = 123
... WHERE did = 123 AND wn = 456
... WHERE wn = 456 AND did = 123

同时,您可以从INDEX(wid)任何WHERE测试单个 wid 值或一系列 wid 的子句中受益。

由于我不知道您的原始Aand B,因此我无法根据真实的列名回答您的问题。反正:

应该有一种方法可以尽快选择所有具有 A= 的记录。或者尽可能快地用 B= 记录。

对于那些,你需要

INDEX(A)  -- or any index _starting with_ A
INDEX(B)  -- or any index _starting with_ B

但如果其中任何一个是did,请不要添加它。(PK 将负责使其快速。

另外,使用 InnoDB,而不是 MyISAM。唉,在您的情况下,这会导致“空间超出应有的 10 倍”。 如果您选择使用 MyISAM,我将需要重新开始索引建议。

将 A 和 B 映射到列名后,我再给您一个提示。

更多关于索引的讨论:http: //mysql.rjweb.org/doc.php/index_cookbook_mysql

于 2021-02-14T05:28:17.203 回答