database-design - 要索引的列太多 - 使用 mySQL 分区？

Question

我们有一个应用程序，其中包含一个包含 20 多个列的表，这些列都是可搜索的。为所有这些列建立索引会使写查询非常慢；并且任何真正有用的索引通常都必须跨越多个列，从而增加所需的索引数量。

然而，对于 95% 的这些搜索，只需要搜索这些行的一小部分，而且数量相当少——比如 50,000 行。

因此，我们已经考虑使用 mySQL 分区表 - 有一个基本上isActive是我们划分两个分区的列。大多数搜索查询将使用isActive=1. 然后，大多数查询将针对 50,000 行的小分区运行，并且在没有其他索引的情况下很快。

唯一的问题是isActive=1未固定的行；即它不是基于行的日期或任何类似的固定；我们将需要isActive根据该行中数据的使用情况进行更新。据我了解，这没问题；在 UPDATE 查询期间，数据只会从一个分区移动到另一个分区。

不过，我们确实有一个PK在行上id；我不确定这是否有问题；该手册似乎建议分区必须基于任何主键。这对我们来说将是一个巨大的问题，因为主键 ID 没有根据是否行isActive.

score 7 · Accepted Answer

我不是 MySQL 专家。我的重点是 Oracle，但多年来我一直在使用分区，我发现您建议的使用非常合适，但不在分区的主流理解范围内。

低基数列的索引

暂时搁置索引合并。假设您的活动行有些分散，并且与非活动行数的比例为 1:20。假设您的页面大小为 8Kb，每块大约有 20 行。如果您获得非常均匀的 isactive 记录分布，则每个块几乎有 1 个。全表扫描读取表中的每个块/页比使用索引查找这些相同的行要快得多。

因此，假设它们集中而不是均匀分散。即使它们集中在 20% 的页面甚至 10% 的页面中，即使在这些情况下，全表扫描也可以执行索引。

所以现在包括索引合并。如果在您扫描 ISactive 的索引之后，您没有访问该表，而是将这些结果连接到另一个索引的结果，并且最终结果集将产生读取，例如，少于 5% 的块。那么是的，isactive 和索引合并上的索引可能是一个解决方案。

这里需要注意的是，在 MySQL 中索引连接的实现有很多限制。确保这适用于您的情况。但是您说您还有另外 20 个字段可以搜索。因此，如果您不为所有这些索引编制索引，那么有一个可用的第二个索引可以将 IsActive 索引加入其中，那么您将不会使用索引合并/加入。

对低基数列进行分区

现在，如果您在该列上进行分区，您将拥有 5% 的 IsActive = True 块，并且它们将被密集填充。全分区扫描将快速生成活动记录列表，并允许将所有其他谓词用作过滤器而不是索引查找。

但是那个标志改变了，对。

在 Oracle 中，我们有一个命令允许我们启用行迁移。这意味着，当 Is_Active 从 True 更改为 False 时，移动该行所在的分区。这非常昂贵，但仅比索引该列而不是按其分区时发生的索引维护多一点。在一个分区的例子中。Oracle 首先通过更新更改行，然后执行删除，然后执行插入。如果您为该列编制索引，您将更新该行，然后删除 TRUE 的索引条目，然后创建 False 的索引条目。

如果 MySQL 没有行迁移，那么您必须对 crud 包进行编程才能做到这一点。UPDATE_ROW_ISACTIVE(pk IN number) 过程 <---- 类似的东西）将为您执行删除和插入。

关于 Konerak 的回答

虽然我同意并行访问是分区的一种用途，但它并不是唯一的。但是如果你点击他提供的链接，页面最底部的用户评论是：

请注意您的表上的选择性索引较低。如果 Index_Merge 优化与 intersect() 算法一起使用，复杂的 AND/OR WHERE 子句肯定会使您的查询非常缓慢。

这似乎与您的情况有关，因此您可以在 FWIW 中接受该评论。

score 1 · Accepted Answer

如果您要索引那么多“列”，您可能需要重新考虑您的数据结构。例如，将每一列改为行/记录。然后有一个“组 ID”将各个记录链接在一起，并有一个“名称”字段来指示它是什么数据。然后，您的所有数据只需要 1 个索引。

这种名称/值对设置实际上现在相当普遍，并且是某些 noSQL 数据库所基于的。这是您可能想要研究的其他内容。像 MongoDB 这样的东西非常适合索引“所有”数据。

score 0 · Accepted Answer

您不需要为此进行分区 - 只需isActive列上的索引就足够了。请注意，MySQL 可以使用Index Merge操作来使用这两个索引。

当分区允许并行执行搜索时，它们将很有用：例如，如果您按日期分区，则可以同时搜索 5 个分区以查找跨越 5 年的结果。

score -1 · Accepted Answer

Your description of the "table" and the "database" are classic symptoms of a lack of Normalisation. A "table" with 20 searchable columns is not 3NF and probably not even 1NF. The best advice is to go back to first principles, and normalise the data, that will result in much narrower tables, and also fewer rows per table, but sure, mote tables. However the result also has fewer indices, per table, and overall.

And a much faster database. Fat-wide "tables" are a disaster for performance, at every level.

Partitions do not apply here, they will not ease your problem.

An id PK is an additional column and index, a surrogate, a substitute (but not a replacement) for the real Primary Key. If you used Relational modelling techniques, that can be eliminated, at least getting down to 19 searchable indices. Any serious work on the "table" will be centred around the real PK, not the surrogate, for example, as you have seen from the restrictions re Partitions.

If you wish to discuss it, please post your DDL for the "table", plus every connected "table".

Response to Comments

The table is best thought of as "emails" but with a lot of extra fields (category/department/priority/workflow/owner) which are all properly normalised. There are a range of other variables as well including quite a lot of timestamps.

That's the very definition of a flat file, at 0NF. Unless you are using some unwritten definition of "Normalisation", it is, by your own description, not Normalised at all. It is the article one starts with before any Normalisation is commenced.

No doubt the indices will be fat-wide as well, in order to be useful for queries.
and you may not have realised yet, there is massive data duplication in that file, and Update Anomalies (when you update a column in one row, you have to update the duplicated value in the other rows), which makes your application unnecessarily complex.

You need to understand that all the Relational DBMS vendors write Relational database engines that are optimised to handle Relational databases. That means they are optimised for Normalised, not Unnormalised or Denormalised, structures.

I will not be drawn into academic arguments, and SO is question-and-answer site, not a debating site. As requested, post your DDL for the file, and all connected files, and we can definitely (a) give it some speed and (b) avoid 20+ indices (which is another common symptom of the condition). That will deal with a specific real world issue and solve it, and avoid debate.

Second, you seem to have the roles mixed up. It is you, with the problem, posting the question on SO, and it is me who has fixed hundreds of performance problems, answering. By definition the solution is outside your domain, otherwise you would have solved it, and thus you would not be posting a question; so it does not work when you tell me how to fix your problem. That would be tying me up in the same limitations that yo have, and thus ensuring that I do not fix the problem.

Also from our tests, having lots of tables to JOIN against that we need to include in the WHERE clause only makes the query slower.

Actually I tune databases for a living, and I have hundreds of tests that demonstrate joining many, smaller, tables is faster. It would be interesting to look into the test and the coding capability of the coder, but that would start a debate, so let's not do that; let's stick to the question. If you want examples of (a) serious testing which (b) proves what I have stated before being challenged, here's just one example fully documented and under scrutiny of, and corresponding test with, stalwarts in the Oracle world.

You may also be interested in this question/answer, which killed the same debate you are approaching.

Joins cost nothing. The files you join to; and the number of records joined on either side; the usefulness of an indices, that is where the cost lies. If it is another Unnormalised file (fat, wide, many optional columns), sure it will be slow.

Anyway, if you are genuinely interested in fixing your posted problem, post all your DDL and we can make it faster for you. If all you want is a yes/no answer re partitions (and to not address the causative problem), that's fine too; you already have that.

database-design - 要索引的列太多 - 使用 mySQL 分区？

4 回答 4

低基数列的索引

对低基数列进行分区

Related

Reference