我有以下查询:
select
t.Chunk as LeftChunk,
t.ChunkHash as LeftChunkHash,
q.Chunk as RightChunk,
q.ChunkHash as RightChunkHash,
count(t.ChunkHash) as ChunkCount
from
chunks as t
join
chunks as q
on
t.ID = q.ID
group by LeftChunkHash, RightChunkHash
以及下面的解释表:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE t ALL IDIndex NULL NULL NULL 17796190 "Using temporary; Using filesort"
1 SIMPLE q ref IDIndex IDIndex 4 sotero.t.Id 12
注意“使用临时;使用文件排序”。
运行此查询时,我很快用完了 RAM(可能是临时表的 b/c),然后 HDD 启动,查询速度减慢到停止。
我认为这可能是一个索引问题,所以我开始添加一些有意义的内容:
Table Non_unique Key_name Seq_in_index Column_name Collation Cardinality Sub_part Packed Null Index_type Comment Index_comment
chunks 0 PRIMARY 1 ChunkId A 17796190 NULL NULL BTREE
chunks 1 ChunkHashIndex 1 ChunkHash A 243783 NULL NULL BTREE
chunks 1 IDIndex 1 Id A 1483015 NULL NULL BTREE
chunks 1 ChunkIndex 1 Chunk A 243783 NULL NULL BTREE
chunks 1 ChunkTypeIndex 1 ChunkType A 2 NULL NULL BTREE
chunks 1 chunkHashByChunkIDIndex 1 ChunkHash A 243783 NULL NULL BTREE
chunks 1 chunkHashByChunkIDIndex 2 ChunkId A 17796190 NULL NULL BTREE
chunks 1 chunkHashByChunkTypeIndex 1 ChunkHash A 243783 NULL NULL BTREE
chunks 1 chunkHashByChunkTypeIndex 2 ChunkType A 261708 NULL NULL BTREE
chunks 1 chunkHashByIDIndex 1 ChunkHash A 243783 NULL NULL BTREE
chunks 1 chunkHashByIDIndex 2 Id A 17796190 NULL NULL BTREE
但仍然使用临时表。
数据库引擎是 MyISAM。
我怎样才能摆脱使用临时的;在此查询中使用文件排序?
Just changing to InnoDB w/o explaining the underlying cause is not a particularly satisfying answer. Besides, if the solution is to just add the proper index, then that's much easier than migrating to another db engine.
I am new to relational databases. So I'm hoping that the solution is something obvious to the experts.
EDIT1:
ID is not the primary key. ChunkID is. There are approximately 40 ChunkIDs for each ID. So adding an additional ID to the table adds about 40 rows. Each unique chunk has a unique chunkHash associated with it.
EDIT2:
Here's the schema:
Field Type Null Key Default Extra
ChunkId int(11) NO PRI NULL
ChunkHash int(11) NO MUL NULL
Id int(11) NO MUL NULL
Chunk varchar(255) NO MUL NULL
ChunkType varchar(255) NO MUL NULL
EDIT 3:
查询的最终目标是创建一个跨文档的单词共现表。ChunkID 是单词实例。每个实例都是与特定文档 (ID) 相关联的单词。每个文档大约有 40 个单词。大约 100 万份文件。因此,与(显然)正在创建的完整交叉产品临时表相比,生成的共现表被高度压缩。也就是说,完整的叉积临时表是 1 百万 * 40 * 40 = 16 亿行。压缩后的结果表估计大约有 4000 万行。
编辑4:
添加 postgresql 标记以查看是否有任何 postgresql 用户可以在该 SQL 实现上获得更好的执行计划。既然如此,那我就换了。