mysql - SQL中基于聚集索引和非聚集索引优化查询？

Question

我最近一直在阅读有关如何clustered index和non-clustered index工作的信息。我的理解很简单（如果错了，请纠正我）：

clustered支持的数据结构non-clustered index是B-Tree

Clustered Index：根据索引列（或键）对数据进行物理排序。你只能有一个clustered Index每个table. 如果index在建表时指定 no，SQL服务器会自动clustered Index在primary key column.

Q1：由于数据是基于索引进行物理排序的，所以这里不需要额外的空间。这个对吗？那么当我删除我创建的索引时会发生什么？

Non-clustered Index：在中non-clustered indexes，leaf-node树的包含列值和指向数据库中实际行的指针（行定位器）。这里需要额外的空间将其non-clustered index table物理存储在磁盘上。但是，一个不受数量的限制non-clustered Indexes.

Q2 : 这是否意味着对非聚集索引列的查询不会产生排序后的数据？

Q3：这里有一个额外的查找关联，以使用叶节点处的指针定位实际的行数据。与聚集索引相比，这会有多大的性能差异？

练习：

考虑一个 Employee 表：

CREATE TABLE Employee
(
PersonID int PRIMARY KEY,
Name varchar(255),
age int,
salary int
);

现在我创建了一个员工表（创建了员工的默认聚集索引）。

此表上的两个频繁查询仅发生在年龄和工资列上。为简单起见，假设该表不经常更新

例如：

select * from employee where age > XXX;

select * from employee where salary > XXXX and salary < YYYY;

Q4：构建索引的最佳方法是什么，以便对这两个列的查询具有相似的性能。如果我对年龄列有聚集索引，则年龄列上的查询会更快，但比薪水列上的查询速度会慢。

Q5：在相关的说明中，我反复看到应该在具有唯一约束的列上创建索引（集群和非集群）。这是为什么？如果不这样做会发生什么？

非常感谢我阅读的帖子在这里：

http://javarevisited.blogspot.com/2013/08/difference-between-clustered-index-and-nonclustered-index-sql-server-database.html

http://msdn.microsoft.com/en-us/library/ms190457.aspx

score 5 · Accepted Answer

我不了解 Microsoft SQL Server 的内部结构，但我可以回答 MySQL，您已标记为您的问题。其他实现的细节可能会有所不同。

Q1。对，聚集索引不需要额外的空间。

如果删除聚集索引会发生什么？MySQL 的 InnoDB 引擎总是使用主键（或第一个非空唯一键）作为聚集索引。如果你定义了一个没有主键的表，或者你删除了现有表的主键，InnoDB 会为聚集索引生成一个内部人工键。这个内部键没有逻辑列来引用它。

Q2。不保证使用非聚集索引的查询返回的行顺序。实际上，它是访问行的顺序。如果您需要按特定顺序返回行，则应ORDER BY在查询中使用。如果优化器可以推断出您想要的顺序与它访问行的顺序相同（索引顺序，无论是通过聚集索引还是非聚集索引），那么它可以跳过排序步骤。

Q3。InnoDB 非聚集索引在索引的叶子上没有指向相应行的指针，它具有主键的值。所以在非聚集索引中查找实际上是两次 B 树搜索，第一次查找非聚集索引的叶子，然后在聚集索引中进行第二次搜索。

这是单个 B-tree 搜索成本的两倍（或多或少），因此 InnoDB 有一个称为Adaptive Hash Index的额外功能。经常搜索的值被缓存在 AHI 中，下次查询搜索缓存的值时，它可以进行 O(1) 查找。在 AHI 缓存中，它直接找到指向聚集索引的叶子的指针，因此它在部分时间消除了B 树搜索。

这在多大程度上提高了总体性能取决于您搜索之前搜索过的相同值的频率。根据我的经验，散列搜索与非散列搜索的比率通常约为 1:2。

Q4。构建索引以服务于您需要优化的查询。通常，聚集索引是主键或唯一键，至少在 InnoDB 的情况下，这是必需的。既不age也salary不可能是独一无二的。

您可能会喜欢我的演示文稿，如何设计索引，真的。

Q5。当你声明一个唯一约束时，InnoDB 会自动创建一个索引。如果没有为它存在的索引，您就不能拥有该约束。如果没有索引，插入值时引擎如何确保唯一性？它需要在整个表中搜索该列中的重复值。该索引有助于使唯一检查更加有效。

score 3 · Accepted Answer

For SQL Server

Q1 Extra space is only needed for the clustered index if it is not unique. SQL Server will add a 4 byte uniquifier internally to a non-unique clustered index. This is because it uses the cluster key as a rowid in non-clustered indexes.

Q2 A non-clustered index can be read in order. That may aid queries where you specify an order. It may also make merge joins attractive. It will also help with range queries (x < col and y > col).

Q3 SQL Server does an extra "bookmark lookup" when using a non-clustered index. But, this is only if it needs a column that isn't in the index. Note also, that you can include extra columns in the leaf level of indexs. If an index can be used without the additional lookup it is called a covering index.

If a bookmark lookup is required, it doesn't take a high percentage of rows until it's quicker just to scan the whole clustered index. The level depends on row size, key size etc. But 5% of rows is a typical cut off.

Q4 If the most important thing in your application was making both these queries as fast as possible, you could create covering index on both of them:

create index IX_1 on employee (age) include (name, salary);
create index IX_2 on employee (salary) include (name, age);

Note you don't have to specifically include the cluster key, as the non-clustered index has it as the row pointer.

Q5 This is more important for cluster keys than non-cluster keys due to the uniquifier. The real issue though is whether an index is selective or not for your queries. Imagine an index on a bit value. Unless the distribution of data is very skewed, such an index is unlikely to be used for anything.

More info about the uniquifier. Imagine you and a non unique clustered index on age, and a non-clustered index on salary. Say you had the following rows:

age | salary | uniqifier
20  | 1000   | 1
20  | 2000   | 2

Then the salary index would locate rows like so

1000 -> 20, 1
2000 -> 20, 2

Say you ran the query select * from employee where salary = 1000, and the optimizer chose to use the salary index. It would then find the pair (20, 1) from the index lookup, then lookup this value in the main data.

mysql - SQL中基于聚集索引和非聚集索引优化查询？

2 回答 2

Related

Reference