sql - faster way to use sets in MySQL

Question

I have a MySQL 5.1 InnoDB table (customers) with the following structure:

int         record_id (PRIMARY KEY)
int         user_id (ALLOW NULL)
varchar[11] postcode (ALLOW NULL)
varchar[30] region (ALLOW NULL)
..
..
..

There are roughly 7 million rows in the table. Currently, the table is being queried like this:

SELECT * FROM customers WHERE user_id IN (32343, 45676, 12345, 98765, 66010, ...

in the actual query, currently over 560 user_ids are in the IN clause. With several million records in the table, this query is slow!

There are secondary indexes on table, the first of which being on user_id itself, which I thought would help.

I know that SELECT(*) is A Bad Thing and this will be expanded to the full list of fields required. However, the fields not listed above are more ints and doubles. There are another 50 of those being returned, but they are needed for the report.

I imagine there's a much better way to access the data for the user_ids, but I can't think how to do it. My initial reaction is to remove the ALLOW NULL on the user_id field, as I understand NULL handling slows down queries?

I'd be very grateful if you could point me in a more efficient direction than using the IN ( ) method.

EDIT Ran EXPLAIN, which said:

select_type = SIMPLE 
table = customers 
type = range 
possible_keys = userid_idx 
key = userid_idx 
key_len = 5 
ref = (NULL) 
rows = 637640 
Extra = Using where

does that help?

score 3 · Accepted Answer

First, check if there is an index on USER_ID and make sure it's used.

You can do it with running EXPLAIN.

Second, create a temporary table and use it in a JOIN:

CREATE TABLE temptable (user_id INT NOT NULL)

SELECT  *
FROM    temptable t
JOIN    customers c
ON      c.user_id = t.user_id

Third, how may rows does your query return?

If it returns almost all rows, then it just will be slow, since it will have to pump all these millions over the connection channel, to begin with.

NULL will not slow your query down, since the IN condition only satisfies non-NULL values which are indexed.

Update:

The index is used, the plan is fine except that it returns more than half a million rows.

Do you really need to put all these 638,000 rows into the report?

Hope its not printed: bad for rainforests, global warming and stuff.

Speaking seriously, you seem to need either aggregation or pagination on your query.

score 2 · Accepted Answer

“选择*”并没有一些人想象的那么糟糕；如果基于行的数据库获取其中任何一个，它们将获取整行，因此在您不使用覆盖索引的情况下，“SELECT *”基本上不比“SELECT a,b,c”慢（注意：有当您有大型 BLOB 时，有时是一个例外，但这是一种极端情况）。

首先，您的数据库是否适合 RAM？如果没有，请获得更多 RAM。不，认真的。现在，假设您的数据库太大而无法合理地放入 ram (Say, > 32Gb) ，您应该尝试减少随机 I/O 的数量，因为它们可能是阻碍事情的原因。

从这里开始，我假设您正在运行适当的服务器级硬件，其中包含 RAID1（或 RAID10 等）中的 RAID 控制器和至少两个主轴。如果你不是，走开，得到那个。

您绝对可以考虑使用聚集索引。在 MySQL InnoDB 中，您只能对主键进行集群，这意味着如果其他东西当前是主键，您将不得不更改它。复合主键是可以的，如果你在一个标准上做很多查询（比如 user_id），让它成为主键的第一部分是一个明确的好处（你需要添加其他东西来使它独特的）。

或者，您可以让您的查询使用覆盖索引，在这种情况下，您不需要 user_id 作为主键（事实上，它一定不是）。仅当您需要的所有列都在以 user_id 开头的索引中时才会发生这种情况。

就查询效率而言，WHERE user_id IN（大 ID 列表）几乎可以肯定是 SQL 中最有效的方法。

但我最大的建议是：

心中有一个目标，弄清楚它是什么，当你达到它时，停下来。
不要相信任何人的话 - 试试看
确保您的性能测试系统与生产的硬件规格相同
确保您的性能测试系统具有与生产环境相同的数据大小和类型（相同的模式还不够好！）。
如果无法使用生产数据，请使用合成数据（复制生产数据可能在逻辑上很困难（记住您的数据库 >32Gb）；它也可能违反安全策略）。
如果您的查询是最佳的（可能已经如此），请尝试调整架构，然后调整数据库本身。

score 1 · Accepted Answer

他们每次都是相同的〜560 id吗？还是在不同的查询运行中有不同的约 500 个 ID？

您可以将您的 560 个用户 ID 插入一个单独的表（甚至是一个临时表），在该表上粘贴一个索引并将其内部连接到您的原始表。

score 1 · Accepted Answer

这是您最重要的查询吗？这是一个事务表吗？

如果是这样，请尝试在 user_id 上创建聚集索引。您的查询可能很慢，因为它仍然必须进行随机磁盘读取以检索列（键查找），即使在找到匹配的记录之后（在 user_Id 索引上查找索引）。

如果您无法更改聚集索引，那么您可能需要考虑使用 ETL 过程（最简单的方法是将触发器插入到具有最佳索引的另一个表中）。这应该会产生更快的结果。

另请注意，如此大的查询可能需要一些时间来解析，所以如果可能的话，通过将查询的 id 放入临时表来帮助它

score 0 · Accepted Answer

您可以尝试在临时表中插入需要查询的 id，并内部连接两个表。我不知道这是否会有所帮助。

sql - faster way to use sets in MySQL

5 回答 5

Related

Reference