caching - 18M+ 行表的子查询和 MySQL 缓存

Question

由于这是我的第一篇文章，我似乎只能发布 1 个链接，所以我在底部列出了我所指的网站。简而言之，我的目标是让数据库更快地返回结果，我试图包含尽可能多的相关信息，以帮助在帖子底部构建问题。

机器信息

8 processors
model name      : Intel(R) Xeon(R) CPU           E5440  @ 2.83GHz
cache size      : 6144 KB
cpu cores       : 4 

top - 17:11:48 up 35 days, 22:22, 10 users,  load average: 1.35, 4.89, 7.80
Tasks: 329 total,   1 running, 328 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  0.0%sy,  0.0%ni, 87.4%id, 12.5%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   8173980k total,  5374348k used,  2799632k free,    30148k buffers
Swap: 16777208k total,  6385312k used, 10391896k free,  2615836k cached

但是，我们正在考虑将 mysql 安装移动到集群中具有 256 GB 内存的另一台机器上

表信息

我的 MySQL 表看起来像

CREATE TABLE ClusterMatches 
(
    id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
    cluster_index INT, 
    matches LONGTEXT,
    tfidf FLOAT,
    INDEX(cluster_index)   
);

它有大约 18M 行，有 1M 唯一 cluster_index 和 6K 唯一匹配。我在 PHP 中生成的 sql 查询看起来像。

SQL查询

$sql_query="SELECT `matches`,sum(`tfidf`) FROM 
(SELECT * FROM Test2_ClusterMatches WHERE `cluster_index` in (".$clusters.")) 
AS result GROUP BY `matches` ORDER BY sum(`tfidf`) DESC LIMIT 0, 10;";

其中 $cluster 包含大约 3,000 个逗号分隔的 cluster_index 的字符串。此查询使用大约 50,000 行，运行大约需要 15 秒，当再次运行相同的查询时，运行大约需要 1 秒。

用法

可以假定表格的内容是静态的。
并发用户数量少
上面的查询是当前唯一将在表上运行的查询

子查询

基于这篇文章 [stackoverflow: Cache/Re-Use a Subquery in MySQL][1] 和查询时间的改进，我相信我的子查询可以被索引。

mysql> EXPLAIN EXTENDED SELECT `matches`,sum(`tfidf`) FROM 
(SELECT * FROM ClusterMatches WHERE `cluster_index` in (1,2,...,3000) 
AS result GROUP BY `matches` ORDER BY sum(`tfidf`) ASC LIMIT 0, 10;

+----+-------------+----------------------+-------+---------------+---------------+---------+------+-------+---------------------------------+
| id | select_type | table                | type  | possible_keys | key           | key_len | ref  | rows  | Extra                           |
+----+-------------+----------------------+-------+---------------+---------------+---------+------+-------+---------------------------------+
|  1 | PRIMARY     |  derived2            | ALL   | NULL          | NULL          | NULL    | NULL | 48528 | Using temporary; Using filesort | 
|  2 | DERIVED     | ClusterMatches       | range | cluster_index | cluster_index | 5       | NULL | 53689 | Using where                     | 
+----+-------------+----------------------+-------+---------------+---------------+---------+------+-------+---------------------------------+

根据这篇较早的文章 [Optimizing MySQL: Queries and Indexes][2] in Extra info - 这里看到的不好的是“使用临时”和“使用文件排序”

MySQL 配置信息

查询缓存可用，但实际上已关闭，因为大小当前设置为零


mysqladmin variables;
+---------------------------------+----------------------+
| Variable_name                   | Value                |
+---------------------------------+----------------------+
| bdb_cache_size                  | 8384512              | 
| binlog_cache_size               | 32768                | 
| expire_logs_days                | 0                    |
| have_query_cache                | YES                  | 
| flush                           | OFF                  |
| flush_time                      | 0                    |
| innodb_additional_mem_pool_size | 1048576              |
| innodb_autoextend_increment     | 8                    |
| innodb_buffer_pool_awe_mem_mb   | 0                    |
| innodb_buffer_pool_size         | 8388608              |
| join_buffer_size                | 131072               |
| key_buffer_size                 | 8384512              |
| key_cache_age_threshold         | 300                  |
| key_cache_block_size            | 1024                 |
| key_cache_division_limit        | 100                  |
| max_binlog_cache_size           | 18446744073709547520 | 
| sort_buffer_size                | 2097144              |
| table_cache                     | 64                   | 
| thread_cache_size               | 0                    | 
| query_cache_limit               | 1048576              |
| query_cache_min_res_unit        | 4096                 |
| query_cache_size                | 0                    |
| query_cache_type                | ON                   |
| query_cache_wlock_invalidate    | OFF                  |
| read_rnd_buffer_size            | 262144               |
+---------------------------------+----------------------+

基于这篇关于 [Mysql 数据库性能转向][3] 的文章，我认为我需要调整的值是

表缓存
key_buffer
排序缓冲区
读取缓冲区大小
record_rnd_buffer（用于 GROUP BY 和 ORDER BY 术语）

确定需要改进的领域 - MySQL 查询调整

将匹配的数据类型更改为指向另一个表的 int 索引[如果 MySQL 包含可变长度字段（如 TEXT 或 BLOB），MySQL 确实会使用动态行格式，在这种情况下，这意味着需要在磁盘上进行排序. 解决方案不是避开这些数据类型，而是将这些字段拆分为关联表。][4]
索引新的 match_index 字段，以便 GROUP BYmatches发生得更快，基于语句 [“您可能应该为您正在选择、分组、排序或加入的任何字段创建索引。”] [5]

工具

调整执行我计划使用

[解释][6]参考[输出格式][7]
[ab - Apache HTTP 服务器基准测试工具][8]
[分析][9] 与 [日志数据][10]

未来的数据库大小

目标是构建一个系统，该系统可以有 1M 唯一 cluster_index 值 1M 唯一匹配值，大约 3,000,000,000 表行，对查询的响应时间约为 0.5 秒（我们可以根据需要添加更多内存并在整个集群中分布数据库）

问题

我认为我们希望将整个记录集保留在 ram 中，以便查询不会触及磁盘，如果我们将整个数据库保留在 MySQL 缓存中，这是否消除了对 memcachedb 的需要？
试图将整个数据库保存在 MySQL 缓存中是一个糟糕的策略，因为它的设计目的不是持久化吗？像 memcachedb 或 redis 这样的方法会是更好的方法吗，如果是，为什么？
查询完成时，查询创建的临时表“结果”是否会自动销毁？
我们是否应该从 Innodb 切换到 MyISAM [因为它有利于读取大量数据，而 InnoDB 有利于写入大量数据][11]？
我的缓存在我的 [查询缓存配置][12] 中似乎没有设为零，为什么查询当前在我第二次运行时发生得更快？
我可以重组我的查询以消除“使用临时”和“使用文件排序”的发生，我应该使用联接而不是子查询吗？
如何查看 MySQL [Data Cache][13] 的大小？
您建议将值 table_cache、key_buffer、sort_buffer、read_buffer_size、record_rnd_buffer 的大小作为起点？

链接

1：stackoverflow.com/questions/658937/cache-re-use-a-subquery-in-mysql
2：databasejournal.com/features/mysql/article.php/10897_1382791_4/Optimizing-MySQL-Queries-and-Indexes.htm
3：debianhelp.co.uk/mysqlperformance.htm
4：20bits.com/articles/10-tips-for-optimizing-mysql-queries-that-dont-suck/
5：20bits.com/articles/10-tips-for-optimizing-mysql-queries-that-dont-suck/
6：dev.mysql.com/doc/refman/5.0/en/explain.html
7：dev.mysql.com/doc/refman/5.0/en/explain-output.html
8：httpd.apache.org/docs/2.2/programs/ab.html
9：mtop.sourceforge.net/
10：dev.mysql.com/doc/refman/5.0/en/slow-query-log.html
11：20bits.com/articles/10-tips-for-optimizing-mysql-queries-that-dont-suck/
12：dev.mysql.com/doc/refman/5.0/en/query-cache-configuration.html
13：dev.mysql.com/tech-resources/articles/mysql-query-cache.html

score 1 · Accepted Answer

Changing the table

Based on the advice in this post on How to pick indexes for order by and group by queries the table now looks like

CREATE TABLE ClusterMatches 
(
    cluster_index INT UNSIGNED, 
    match_index INT UNSIGNED,
    id INT NOT NULL AUTO_INCREMENT,
    tfidf FLOAT,
    PRIMARY KEY (match_index,cluster_index,id,tfidf)
);
CREATE TABLE MatchLookup 
(
    match_index INT UNSIGNED NOT NULL PRIMARY KEY,
    image_match TINYTEXT
);

Eliminating Subquery

The query without sorting the results by the SUM(tfidf) looks like

SELECT match_index, SUM(tfidf) FROM ClusterMatches 
WHERE cluster_index in (1,2,3 ... 3000) GROUP BY match_index LIMIT 10;

Which eliminates using temporary and using filesort

explain extended SELECT match_index, SUM(tfidf) FROM ClusterMatches 
WHERE cluster_index in (1,2,3 ... 3000) GROUP BY match_index LIMIT 10;
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+--------------------------+
| id | select_type | table                | type  | possible_keys | key     | key_len | ref  | rows  | Extra                    |
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+--------------------------+
|  1 | SIMPLE      | ClusterMatches       | range | PRIMARY       | PRIMARY | 4       | NULL | 14938 | Using where; Using index | 
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+--------------------------+

Sorting Problem

However if i add the ORDER BY SUM(tfdif) in

SELECT match_index, SUM(tfidf) AS total FROM ClusterMatches
WHERE cluster_index in (1,2,3 ... 3000) GROUP BY match_index 
ORDER BY total DESC LIMIT 0,10;
+-------------+--------------------+
| match_index | total              |
+-------------+--------------------+
|         868 |   0.11126546561718 | 
|        4182 | 0.0238558370620012 | 
|        2162 | 0.0216601379215717 | 
|        1406 | 0.0191618576645851 | 
|        4239 | 0.0168981291353703 | 
|        1437 | 0.0160425212234259 | 
|        2599 | 0.0156466849148273 | 
|         394 | 0.0155945559963584 | 
|        3116 | 0.0151005545631051 | 
|        4028 | 0.0149106932803988 | 
+-------------+--------------------+
10 rows in set (0.03 sec)

The result is suitably fast at this scale BUT having the ORDER BY SUM(tfidf) means it uses temporary and filesort

explain extended SELECT match_index, SUM(tfidf) AS total FROM ClusterMatches 
WHERE cluster_index IN (1,2,3 ... 3000) GROUP BY match_index 
ORDER BY total DESC LIMIT 0,10;
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+-----------------------------------------------------------+
| id | select_type | table                | type  | possible_keys | key     | key_len | ref  | rows  | Extra                                                     |
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+-----------------------------------------------------------+
|  1 | SIMPLE      | ClusterMatches       | range | PRIMARY       | PRIMARY | 4       | NULL | 65369 | Using where; Using index; Using temporary; Using filesort | 
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+-----------------------------------------------------------+

Possible Solutions?

Im looking for a solution that doesn't use temporary or filesort, along the lines of

SELECT match_index, SUM(tfidf) AS total FROM ClusterMatches 
WHERE cluster_index IN (1,2,3 ... 3000) GROUP BY cluster_index, match_index 
HAVING total>0.01 ORDER BY cluster_index;

where I dont need to hardcode a threshold for total, any ideas?