1

我有一个在 MySQL 中运行的查询。如您所见,查询的每个部分都在索引字段上。然而,查询需要很长时间(几十分钟,比我愿意等待的时间长)。Connect 表由两个整数和两个索引组成(一个字段一,字段二,另一个字段二,字段一)。源和目标是具有单个索引 int 字段的表。鉴于所有索引,我希望此查询在几秒钟内完成。关于 1 的任何建议:为什么需要这么长时间,以及 2:如何让它更快?

谢谢!

mysql> explain 
SELECT DISTINCT geneConnect.geneSymbolID FROM SNPEffectGeneConnector AS geneConnect 
  JOIN IndelSNPEffectConnector AS snpEConnect ON geneConnect.snpEffectID = snpEConnect.snpEffectID 
  JOIN InDels2 AS source ON source.id = snpEConnect.indelID 
  WHERE geneConnect.geneSymbolID NOT IN (
    SELECT geneConnect.geneSymbolID FROM SNPEffectGeneConnector AS geneConnect 
    JOIN IndelSNPEffectConnector AS snpEConnect ON geneConnect.snpEffectID = snpEConnect.snpEffectID 
    JOIN InDels3 AS target ON target.id = snpEConnect.indelID);
+----+--------------------+-------------+-------+-------------------+----------+---------+-----------------------------------------------------------------------+------+--------------------------------+
| id | select_type        | table       | type  | possible_keys     | key      | key_len | ref                                                                   | rows | Extra                          |
+----+--------------------+-------------+-------+-------------------+----------+---------+-----------------------------------------------------------------------+------+--------------------------------+
|  1 | PRIMARY            | source      | index | id                | id       | 4       | NULL                                                                  | 5771 | Using index; Using temporary   |
|  1 | PRIMARY            | snpEConnect | ref   | snpEList          | snpEList | 4       | treattablebrowser.source.id                                           |    2 | Using index                    |
|  1 | PRIMARY            | geneConnect | ref   | snpEList          | snpEList | 4       | treattablebrowser.snpEConnect.snpEffectID                             |    1 | Using where; Using index       |
|  2 | DEPENDENT SUBQUERY | geneConnect | ref   | snpEList,geneList | geneList | 4       | func                                                                  |    1 | Using index                    |
|  2 | DEPENDENT SUBQUERY | target      | index | id                | id       | 4       | NULL                                                                  | 6297 | Using index; Using join buffer |
|  2 | DEPENDENT SUBQUERY | snpEConnect | ref   | snpEList          | snpEList | 8       | treattablebrowser.target.id,treattablebrowser.geneConnect.snpEffectID |    1 | Using index                    |
+----+--------------------+-------------+-------+-------------------+----------+---------+-----------------------------------------------------------------------+------+--------------------------------+

6 行(0.01 秒)

4

4 回答 4

3

我想这在很大程度上是学术兴趣,现在格雷格自己解决了。很高兴知道我对这些事情的直觉可以完全打破。我仍然可以用三种方式重写它。我认为第一个可以简化,但正如 Greg 指出的那样,简化不起作用。不确定这是否会比原来的更快,尽管它在我在 sql server 上的测试中确实产生了不同的查询计划。

Select Distinct
    g1.geneSymbolID 
From
    SNPEffectGeneConnector AS g1 
        Inner Join
    IndelSNPEffectConnector AS s1 
        ON g1.snpEffectID = s1.snpEffectID 
        Inner Join
    InDels2 AS i2 ON i2.id = s1.indelID 
Where Not Exists (
    Select 'x'
        From
            SNPEffectGeneConnector As g2
                Inner Join
            IndelSNPEffectConnector AS s2 
                On g2.snpEffectID = s2.snpEffectID 
                Inner Join
            InDels3 As i3
                On i3.id = s2.indelID
        Where
            g2.geneSymbolID = g1.geneSymbolID
    );

我不是 100% 确定第二种方式,但它适用于我非常少量的测试数据。如果它有效,它的查询计划要短得多(不一定更快,但一个很好的指示):

Select
  geneSymbolID
From
  SNPEffectGeneConnector As g
    Inner Join 
  IndelSNPEffectConnector As s
    ON g.snpEffectID = s.snpEffectID 
    Left Outer Join
  InDels2 As i2 
    On i2.id = s.indelID 
    Left Outer Join
  InDels3 As i3
    On i3.id = s.indelID
Group By
    geneSymbolID
Having
    count(i2.id) > 0 And
    count(i3.id) = 0

另一种方法(为非描述性别名道歉):

Select
    g.geneSymbolID
From
    SNPEffectGeneConnector As g
        Inner Join
    IndelSNPEffectConnector AS s
        On g.snpEffectID = s.snpEffectID 
        Inner Join (
        Select 
            i2.id,
            0 As c
        From    
            InDels2 i2
        Union All
        Select
            i3.id,
            1
        From
            InDels3 i3
    ) as i23
    on s.indelID = i23.id
Group By
    g.geneSymbolID
Having
    max(i23.c) = 0;

http://sqlfiddle.com/#!2/944e1/10

于 2012-11-05T22:09:44.210 回答
0
    SELECT DISTINCT geneConnect.geneSymbolID 
    FROM SNPEffectGeneConnector AS geneConnect 
      JOIN IndelSNPEffectConnector AS snpEConnect 
          ON geneConnect.snpEffectID = snpEConnect.snpEffectID 
      JOIN InDels2 AS source ON source.id = snpEConnect.indelID
      LEFT OUTER JOIN InDels3 AS target ON target.id = snpEConnect.indelID
    WHERE target.id is null

上面的查询应该与您的查询等效,并为您提供更好的性能。

于 2012-11-05T22:09:57.057 回答
0

如果我理解正确,您希望找到所有geneSymbolID' 中的SNPEffectGeneConnector条目IndelSNPEffectConnector,以便它们确实具有匹配项 (on indelID) inInDels2没有与in相同 的匹配项。indelIDInDels3

然后您可以运行查询的第一部分(“do”部分),然后进一步连接最后一部分,从而收集所有匹配的基因。ALEFT JOIN与强加匹配失败的基因符号表将产生所有不符合反向标准的基因,因此是感兴趣的。

修改后的答案

这是匹配的查询:

SELECT DISTINCT genes.geneSymbolID
FROM ( SELECT DISTINCT geneSymbolID FROM SNPEffectGeneConnector ) AS genes
JOIN SNPEffectGeneConnector AS effectSource
    ON ( genes.geneSymbolID = effectSource.geneSymbolID)
JOIN SNPEffectGeneConnector AS effectTarget
    ON ( genes.geneSymbolID = effectTarget.geneSymbolID)
JOIN IndelSNPEffectConnector AS indelSource
    ON ( indelSource.snpEffectID = effectSource.snpEffectID )
JOIN IndelSNPEffectConnector AS indelTarget
    ON ( indelTarget.snpEffectID = effectTarget.snpEffectID ) 
     JOIN InDels2 ON ( indelSource.indelId = InDels2.id )
     JOIN InDels3 ON ( indelTarget.indelId = InDels3.id )
;

现在,对于这个查询,我认为您需要这些索引:

CREATE INDEX SNPEffectGeneConnector_ndx
    ON SNPEffectGeneConnector(snpEffectID, geneSymbolID);

CREATE INDEX SNPEffectGeneConnector_ndx2
    ON SNPEffectGeneConnector(geneSymbolID);

CREATE INDEX IndelSNPEffectConnector_ndx
    ON IndelSNPEffectConnector(snpEffectID, indelID);
CREATE [UNIQUE?] INDEX InDels2_ndx ON InDels2(id); -- unless id is primary key
CREATE [UNIQUE?] INDEX InDels3_ndx ON InDels3(id); -- unless id is primary key

获取感兴趣的基因:

SELECT glob.geneSymbolID
    FROM ( SELECT DISTINCT geneSymbolID FROM SNPEffectGeneConnector ) AS glob
    LEFT JOIN (
SELECT DISTINCT genes.geneSymbolID
FROM ( SELECT DISTINCT geneSymbolID FROM SNPEffectGeneConnector ) AS genes
JOIN SNPEffectGeneConnector AS effectSource
    ON ( genes.geneSymbolID = effectSource.geneSymbolID)
JOIN SNPEffectGeneConnector AS effectTarget
    ON ( genes.geneSymbolID = effectTarget.geneSymbolID)
JOIN IndelSNPEffectConnector AS indelSource
    ON ( indelSource.snpEffectID = effectSource.snpEffectID )
JOIN IndelSNPEffectConnector AS indelTarget
    ON ( indelTarget.snpEffectID = effectTarget.snpEffectID ) 
     JOIN InDels2 ON ( indelSource.indelId = InDels2.id )
     JOIN InDels3 ON ( indelTarget.indelId = InDels3.id )
) AS fits ON (glob.geneSymbolID = fits.geneSymbolID)
WHERE fits.geneSymbolID IS NULL;

测试

CREATE TABLE InDels2 ( id integer );
INSERT INTO InDels2 VALUES ( 1 );
CREATE TABLE InDels3 ( id integer );
INSERT INTO InDels3 VALUES ( 2 );
CREATE TABLE IndelSNPEffectConnector ( indelId integer, snpEffectID integer );
INSERT INTO IndelSNPEffectConnector VALUES ( 1, 55 ), ( 2, 88 );
CREATE TABLE SNPEffectGeneConnector ( geneSymbolID integer, snpEffectID integer );
INSERT INTO SNPEffectGeneConnector VALUES ( 100, 55 ), ( 100, 88 );

因此,由于基因 100 连接到连接到 1 的 55,因此在 ID2 中注明,但它也连接到连接到 2 的 88,因此在 ID3 中,它不能出现。

出现什么?如果我理解了要求,我们需要一个基因,引起一个效果,它的插入缺失不在. inDels3因此,比如说,导致效应 77 的基因 42,与 indel 3 相关联,而 indel 3 中存在inDels3,则必须出现。

所以:

INSERT INTO SNPEffectGeneConnector VALUES ( 42, 55 );
INSERT INTO SNPEffectGeneConnector VALUES ( 42, 77 );
INSERT INTO IndelSNPEffectConnector VALUES ( 3, 77 );

产量

+--------------+
| geneSymbolID |
+--------------+
|           42 |
+--------------+

可以使用对第一个查询的修改来检查为什么 42 去,而 100 不去:

SELECT genes.geneSymbolID, effectSource.snpEffectID, effectTarget.snpEffectID, indelSource.indelId AS sourceInDel, indelTarget.indelId AS targetInDel, InDels3.id
FROM ( SELECT DISTINCT geneSymbolID FROM SNPEffectGeneConnector ) AS genes
 JOIN SNPEffectGeneConnector AS effectSource
     ON ( genes.geneSymbolID = effectSource.geneSymbolID)
 JOIN SNPEffectGeneConnector AS effectTarget
     ON ( genes.geneSymbolID = effectTarget.geneSymbolID)
 JOIN IndelSNPEffectConnector AS indelSource
     ON ( indelSource.snpEffectID = effectSource.snpEffectID )
 JOIN IndelSNPEffectConnector AS indelTarget
     ON ( indelTarget.snpEffectID = effectTarget.snpEffectID )

      JOIN InDels2 ON ( indelSource.indelId = InDels2.id )
 LEFT JOIN InDels3 ON ( indelTarget.indelId = InDels3.id );

+--------------+-------------+-------------+-------------+-------------+------+
| geneSymbolID | snpEffectID | snpEffectID | sourceInDel | targetInDel | id   |
+--------------+-------------+-------------+-------------+-------------+------+
|           42 |          55 |          55 |           1 |           1 | NULL |
|           42 |          55 |          77 |           1 |           3 | NULL |
|          100 |          55 |          55 |           1 |           1 | NULL |
|          100 |          55 |          88 |           1 |           2 |    2 |
+--------------+-------------+-------------+-------------+-------------+------+

...100 有一行,其 InDels3 的 ID 不为空,它报告目标 indel 2。

于 2012-11-05T22:38:26.487 回答
0

事实证明,问题在于,虽然所有内容都有索引,但子查询返回的基因 ID没有索引。加入/对未索引的数字集合进行 IN 搜索的性能非常差,这就是我得到的。

我的解决方案是分别进行外部和内部 JOIN,将结果转储到两个不同的索引表中,然后删除 1 中的也位于 2 中的基因 ID。

故事的寓意:永远不要加入或加入任何未编入索引的集合。

于 2012-11-07T03:37:23.543 回答