我问了两个相关问题(如何在运行 sqlite 查询后加快获取结果的速度?以及sqlite.fetchall() 这么慢是否正常?)。我已经改变了一些东西并得到了一些加速,但是 select 语句仍然需要一个多小时才能完成。
我有一个feature
包含rtMin
、rtMax
和值mzMin
的表。mzMax
这些值一起是矩形的角(如果您阅读我的旧问题,我将这些值分开保存,而不是从convexhull
表中获取 min() 和 max(),这样可以更快地工作)。
我得到了一个spectrum
带有一个rt
和一个mz
值的表。我有一个表格,当光谱的rt
和mz
值在特征的矩形中时,它将特征链接到光谱。
为此,我使用以下 sql 和 python 代码来检索频谱和特征的 id:
self.cursor.execute("SELECT spectrum_id, feature_table_id "+
"FROM `spectrum` "+
"INNER JOIN `feature` "+
"ON feature.msrun_msrun_id = spectrum.msrun_msrun_id "+
"WHERE spectrum.scan_start_time >= feature.rtMin "+
"AND spectrum.scan_start_time <= feature.rtMax "+
"AND spectrum.base_peak_mz >= feature.mzMin "+
"AND spectrum.base_peak_mz <= feature.mzMax")
spectrumAndFeature_ids = self.cursor.fetchall()
for spectrumAndFeature_id in spectrumAndFeature_ids:
spectrum_has_feature_inputValues = (spectrumAndFeature_id[0], spectrumAndFeature_id[1])
self.cursor.execute("INSERT INTO `spectrum_has_feature` VALUES (?,?)",spectrum_has_feature_inputValues)
我对执行、提取和插入时间进行了计时,得到了以下结果:
query took: 74.7989799976 seconds
5888.845541 seconds since fetchall
returned a length of: 10822
inserting all values took: 3.29669690132 seconds
所以这个查询大约需要一个半小时,大部分时间都在执行 fetchall()。我怎样才能加快速度?我应该在 python 代码中进行rt
和比较吗?mz
更新:
为了显示我得到的索引,这里是表的创建语句:
CREATE TABLE IF NOT EXISTS `feature` (
`feature_table_id` INT PRIMARY KEY NOT NULL ,
`feature_id` VARCHAR(40) NOT NULL ,
`intensity` DOUBLE NOT NULL ,
`overallquality` DOUBLE NOT NULL ,
`charge` INT NOT NULL ,
`content` VARCHAR(45) NOT NULL ,
`intensity_cutoff` DOUBLE NOT NULL,
`mzMin` DOUBLE NULL ,
`mzMax` DOUBLE NULL ,
`rtMin` DOUBLE NULL ,
`rtMax` DOUBLE NULL ,
`msrun_msrun_id` INT NOT NULL ,
CONSTRAINT `fk_feature_msrun1`
FOREIGN KEY (`msrun_msrun_id` )
REFERENCES `msrun` (`msrun_id` )
ON DELETE NO ACTION
ON UPDATE NO ACTION);
CREATE UNIQUE INDEX `id_UNIQUE` ON `feature` (`feature_table_id` ASC);
CREATE INDEX `fk_feature_msrun1` ON `feature` (`msrun_msrun_id` ASC);
CREATE TABLE IF NOT EXISTS `spectrum` (
`spectrum_id` INT PRIMARY KEY NOT NULL ,
`spectrum_index` INT NOT NULL ,
`ms_level` INT NOT NULL ,
`base_peak_mz` DOUBLE NOT NULL ,
`base_peak_intensity` DOUBLE NOT NULL ,
`total_ion_current` DOUBLE NOT NULL ,
`lowest_observes_mz` DOUBLE NOT NULL ,
`highest_observed_mz` DOUBLE NOT NULL ,
`scan_start_time` DOUBLE NOT NULL ,
`ion_injection_time` DOUBLE,
`binary_data_mz` BLOB NOT NULL,
`binaray_data_rt` BLOB NOT NULL,
`msrun_msrun_id` INT NOT NULL ,
CONSTRAINT `fk_spectrum_msrun1`
FOREIGN KEY (`msrun_msrun_id` )
REFERENCES `msrun` (`msrun_id` )
ON DELETE NO ACTION
ON UPDATE NO ACTION);
CREATE INDEX `fk_spectrum_msrun1` ON `spectrum` (`msrun_msrun_id` ASC);
CREATE TABLE IF NOT EXISTS `spectrum_has_feature` (
`spectrum_spectrum_id` INT NOT NULL ,
`feature_feature_table_id` INT NOT NULL ,
CONSTRAINT `fk_spectrum_has_feature_spectrum1`
FOREIGN KEY (`spectrum_spectrum_id` )
REFERENCES `spectrum` (`spectrum_id` )
ON DELETE NO ACTION
ON UPDATE NO ACTION,
CONSTRAINT `fk_spectrum_has_feature_feature1`
FOREIGN KEY (`feature_feature_table_id` )
REFERENCES `feature` (`feature_table_id` )
ON DELETE NO ACTION
ON UPDATE NO ACTION);
CREATE INDEX `fk_spectrum_has_feature_feature1` ON `spectrum_has_feature` (`feature_feature_table_id` ASC);
CREATE INDEX `fk_spectrum_has_feature_spectrum1` ON `spectrum_has_feature` (`spectrum_spectrum_id` ASC);
更新2:
我有 20938 个光谱、305742 个特征和 2 个 msruns。结果是 10822 个匹配项。
更新 3:
使用新索引 (CREATE INDEX fk_spectrum_msrun1_2
ON spectrum
( msrun_msrun_id
, base_peak_mz
);) 之间保存约 20 秒:查询耗时:76.4599349499 秒 5864.15418601 秒自 fetchall
更新 4:
从 EXPLAIN QUERY PLAN 打印:
(0, 0, 0, u'SCAN TABLE spectrum (~1000000 rows)'), (0, 1, 1, u'SEARCH TABLE feature USING INDEX fk_feature_msrun1 (msrun_msrun_id=?) (~2 rows)')