TL;DR:最好的查询和索引是:
create index uniqueFiles on resume_points (scan_file_id);
select * from (select distinct scan_file_id from resume_points) d join files on d.scan_file_id = files.id and files.dirty = 1;
由于我通常使用 SQL Server,所以起初我认为查询优化器肯定会为这样一个简单的查询找到最佳执行计划,而不管您以哪种方式编写这些等效的 SQL 语句。所以我下载了 SQLite,然后开始玩。令我惊讶的是,性能存在巨大差异。
这是设置代码:
CREATE TABLE files (
id INTEGER PRIMARY KEY autoincrement,
dirty INTEGER NOT NULL);
CREATE TABLE resume_points (
id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL ,
scan_file_id INTEGER NOT NULL );
insert into files (dirty) values (0);
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into resume_points (scan_file_id) select (select abs(random() % 8000000)) from files limit 5000;
insert into resume_points (scan_file_id) select (select abs(random() % 8000000)) from files limit 5000;
我考虑了两个指标:
create index dirtyFiles on files (dirty, id);
create index uniqueFiles on resume_points (scan_file_id);
create index fileLookup on files (id);
以下是我尝试过的查询和 i5 笔记本电脑上的执行时间。数据库文件大小只有大约 200MB,因为它没有任何其他数据。
select distinct files.* from resume_points inner join files on resume_points.scan_file_id=files.id where files.dirty = 1;
4.3 - 4.5ms with and without index
select distinct files.* from files inner join resume_points on files.id=resume_points.scan_file_id where files.dirty = 1;
4.4 - 4.7ms with and without index
select * from (select distinct scan_file_id from resume_points) d join files on d.scan_file_id = files.id and files.dirty = 1;
2.0 - 2.5ms with uniqueFiles
2.6-2.9ms without uniqueFiles
select * from files where id in (select distinct scan_file_id from resume_points) and dirty = 1;
2.1 - 2.5ms with uniqueFiles
2.6-3ms without uniqueFiles
SELECT f.* FROM resume_points rp INNER JOIN files f on rp.scan_file_id = f.id
WHERE f.dirty = 1 GROUP BY f.id
4500 - 6190 ms with uniqueFiles
8.8-9.5 ms without uniqueFiles
14000 ms with uniqueFiles and fileLookup
select * from files where exists (
select * from resume_points where files.id = resume_points.scan_file_id) and dirty = 1;
8400 ms with uniqueFiles
7400 ms without uniqueFiles
看起来 SQLite 的查询优化器根本不是很先进。最好的查询首先将 resume_points 减少到少量行(测试用例中为两个。OP 说会是 1-2。),然后查找文件以查看它是否脏。dirtyFiles
index 对任何文件都没有太大影响。我认为这可能是因为数据在测试表中的排列方式。它可能会对生产表产生影响。但是,差异不是太大,因为查找次数会少于少数。uniqueFiles
确实有所作为,因为它可以将 10000 行 resume_points 减少到 2 行,而无需扫描其中的大部分。fileLookup
确实使一些查询稍微快了一点,但不足以显着改变结果。值得注意的是,它使 group by 非常缓慢。总之,尽早减少结果集以产生最大的差异。