sql - 如何衡量数据库索引的成本？

Question

是否有一个好的方法来判断在 Postgres 中创建数据库索引的成本（较慢INSERTS，构建索引的时间，重新索引的时间）是否值得性能提升（更快SELECTS）？

score 5 · Accepted Answer

我实际上不同意Hexist。PostgreSQL 的规划器非常好，它支持基于物理顺序扫描对表文件的良好顺序访问，因此索引不一定有帮助。此外，在许多情况下，规划者必须选择一个索引。此外，您已经在为唯一约束和主键创建主键。

我认为 PostgreSQL 的一个很好的默认位置（MySQL btw 完全不同！）是等到你需要一个索引来添加一个，然后只添加你最需要的索引。然而，这只是一个起点，它假设要么缺乏查看查询计划的经验，要么缺乏对应用程序可能去向的了解。在这些领域拥有经验很重要。

通常，如果您的表可能跨越 10 页以上（即 40kb 的数据和标题），则使用外键是一个好主意。这些可以被认为是明确需要的。跨越 1 页的小型查找表不应该有非唯一索引，因为这些索引永远不会用于选择（没有查询计划胜过单页的顺序扫描）。

除此之外，您还需要查看数据分布。索引布尔列通常是一个坏主意，并且有更好的方法来索引与布尔搜索相关的事物（部分索引就是一个很好的例子）。类似地，索引常用函数输出有时似乎是个好主意，但情况并非总是如此。考虑：

CREATE INDEX gj_transdate_year_idx ON general_journal (extract('YEAR' FROM transdate));

这不会有太大作用。但是，如果通过递归 CTE 与稀疏索引扫描配对，则 transdate 上的索引可能会很有用。

一旦基本索引到位，那么问题就变成了您需要添加哪些其他索引。这通常比最初设计的更好留给以后的用例审查。人们发现在 PostgreSQL 上减少索引可以显着提高性能的情况并不少见。

要考虑的另一件主要事情是您创建的索引类型，这些索引通常是特定于用例的。例如，如果序数对域很重要，并且如果您经常根据初始元素进行搜索，但如果序数不重要，我建议使用 GIN 索引，因为 btree 会做的很少（当然这是一个原子性危险信号，但有时这在 Pg 中是有意义的）。即使序数很重要，有时您仍然需要 GIN 索引，因为您需要能够进行交换扫描，就好像序数不重要一样。如果使用 ip4r 例如存储 cidr 块并使用 EXCLUDE 约束以确保没有块包含任何其他块，则这是正确的（实际扫描需要使用重叠运算符而不是包含运算符，因为您不这样做'

同样，这在某种程度上是特定于数据库的。例如，在 MySQL 上，Hexist 的建议是正确的。但是，在 PostgreSQL 上，观察问题是件好事。

就测量而言，最好的工具是EXPLAIN ANALYZE

score 3 · Accepted Answer

一般来说，除非您有一个日志或存档表，您不会在其中非常频繁地进行选择（或者如果它们需要一段时间运行也没关系），您应该索引您的 select/update/deelete 语句将在 where 中使用的任何内容条款。

然而，这并不总是像看起来那么简单，仅仅因为列在 where 子句中使用并被索引，并不意味着 sql 引擎将能够使用索引。使用 postgresql 的EXPLAIN和EXPLAIN ANALYZE功能，您可以检查在选择中使用了哪些索引，并帮助您确定列上的索引是否会对您有所帮助。

这通常是正确的，因为如果没有索引，您的选择速度会从一些 O(log n) 查找操作下降到 O(n)，而您的插入速度只会从 cO(log n) 提高到 dO(log n)，其中 d 是通常小于 c，也就是说，您可以通过没有索引来稍微加快插入速度，但是如果没有索引，您将降低选择速度，因此在数据上建立索引几乎总是值得的如果你要选择反对它。

现在，如果你有一些小表，你做了很多插入和更新，并且经常删除所有条目，并且只定期进行一些选择，那么没有任何索引可能会更快。但是那会是一个相当特殊的情况，因此您必须进行一些基准测试并确定它在您的特定情况下是否有意义。

score 0 · Accepted Answer

Nice question. I'd like to add a bit more what @hexist had already mentioned and to the info provided by @ypercube's link.

By design, database don't know in which part of the table it will find data that satisfies provided predicates. Therefore, DB will perform a full or sequential scan of all table's data, filtering needed rows.

Index is a special data structure, that for a given key can precisely specify in which rows of the table such values will be found. The main difference when index is involved:

there is a cost for the index scan itself, i.e. DB has to find a value in the index first;
there's an extra cost of reading specific data from the table itself.

Working with index will lead to a random IO pattern, compared to a sequential one used in the full scan. You can google for the comparison figures of random and sequential disk access, but it might differ up to an order of magnitude (random being slower of course).

Still, it's clear that in some cases Index access will be cheaper and in others Full scan should be preferred. This depends on how many rows (out of all) will be returned by the specified predicate, or it's selectivity:

if predicate will return a relatively small number of rows, say, less then 10% of total, then it seems valuable to pick those directly via Index. This is a typical case for Primary/Unique keys or queries like: I need address information for customer with internal number = XXX;
if predicate has no big impact on the selectivity, i.e. if 30% (or more) rows are returned, then it's cheaper to do a Full scan, 'cos sequential disk access will beat random and data will be delivered faster. All reports, covering big areas (like a month, or all customers) fall here;
if there's a need to obtain an ordered list of values and there's an index, then doing Index scan is the fastest option. This is a special case of #2, when you need report data ordered by some column;
if number of distinct values in the column is relatively small compared to a total number of values, then Index will be a good choice. This is a case called Loose Index Scan, and typical queries will be like: I need 20 most recent purchases for each of the top 5 categories by number of goods.

How DB decides what to do, Index or Full scan? This is a runtime decision and it is based on the statistics, so make sure to keep those up to date. In fact, numbers provided above have no real life value, you have to evaluate each query independently.

All this is a very rough description of what happens. I would very much recommended to look into How PostgreSQL Planner Uses Statistics, this best what I've seen on the subject.

sql - 如何衡量数据库索引的成本？

3 回答 3

Related

Reference