5

I'm currently working on a project involving Patents pulled the USPTO website, as part of this project I'm using the database created by people from the University of Illinois
(Paper: http://abel.lis.illinois.edu/UPDC/USPTOPatentsDatabaseConstruction.pdf)
(Slightly outdated Schema of the Tables I'm using, only missing non-index/key values: http://i.imgur.com/44LHS3L.png)

Now as the title says I'm trying to optimize a query:

SELECT 
        PN,
        AN,
        grants.GrantID,
        grants.FileDate,
        grants.IssueDate,
        grants.Kind,
        grants.ApplicationID,
        assignee_g.OrgName,
        GROUP_CONCAT(DISTINCT CONCAT_WS(', ', assignee_g.City, assignee_g.State, assignee_g.Country) separator ';') as Assignee,
        GROUP_CONCAT(DISTINCT CONCAT_WS(', ', inventor_g.FirstName, inventor_g.LastName) separator ';') as Inventor,
        GROUP_CONCAT(DISTINCT CONCAT_WS(', ', inventor_g.City, inventor_g.State, inventor_g.Country) separator ';') as Inventor_address,
        GROUP_CONCAT(DISTINCT CONCAT_WS(', ', usclass_g.Class, usclass_g.Subclass) separator ';') as USClass,
        intclass_g.Section,
        intclass_g.Class,
        intclass_g.Subclass,
        intclass_g.MainGroup,
        intclass_g.SubGroup
FROM
    (
    SELECT grants.GrantID as CitingID, CitedID as PN, grants2.ApplicationID AS AN
    FROM
        gracit_g, grants, grants as grants2
    Where
        grants.GrantID IN (*A 
                                         couple 
                                           Thousand
                                              keys*)
            and grants.GrantID = gracit_g.GrantID and grants2.GrantID = CitedID 
    LIMIT 500000) tbl1,
             grants, assignee_g, inventor_g, usclass_g, intclass_g
WHERE
    grants.GrantID = tbl1.CitingID
        and grants.GrantID = assignee_g.GrantID
        and grants.GrantID = inventor_g.GrantID
        and grants.GrantID = usclass_g.GrantID
        and grants.GrantID = intclass_g.GrantID
GROUP BY PN, GrantID
LIMIT 50000000

Pretty much for each Patent Being cited by one that came after it I want to record the information of the patent citing it. The problem I seem to be running into is that my "GROUP BY PN, GrantID" is causing "Using Temporary, Using Filesort" which is severely slowing down my endeavours.

This is what my explain gave to me (sorry if it's not perfectly formatted, I couldn't find how to make a table)

1
PRIMARY
derived2
ALL
8716
possible key: null
key: null
key_len: null
ref: null
Using temporary; Using filesort


1
PRIMARY
grants
eq_ref
PRIMARY
PRIMARY
62
tbl1.CitingID
1


1
PRIMARY
assignee_g
ref
PRIMARY,FK_PublicationID_PUBLICATION_ASSIGNEE_P
PRIMARY
62
tbl1.CitingID
1


1
PRIMARY
intclass_g
ref
PRIMARY,fk_publicationid_PUBLICATION_INTERNATIONALCLASS_P
PRIMARY
62
tbl1.CitingID
1


1
PRIMARY
inventor_g
ref
PRIMARY,fk_PublicationID_Inventor_p
PRIMARY
62
tbl1.CitingID
1


1
PRIMARY
usclass_g
ref
PRIMARY,fk_publicationid_PUBLICATION_USCLASS_P
PRIMARY
62
tbl1.CitingID
2


2
DERIVED
grants
range
PRIMARY
PRIMARY
62
ref: null
2179
Using where; Using index


2
DERIVED
gracit_g
ref
PRIMARY,FK_PublicationID_PUBLICATION_PCITATION_P,CitedID
PRIMARY
62
uspto_patents.grants.GrantID
4
Using where


2
DERIVED
grants2
eq_ref
PRIMARY
PRIMARY
62
uspto_patents.gracit_g.CitedID
1


The SHOW CREATE for gracit_g is:

CREATE TABLE `gracit_g` (
`GrantID` varchar(20) NOT NULL,
`Position` int(11) NOT NULL,
`CitedID` varchar(20) DEFAULT NULL,
`Kind` varchar(10) DEFAULT NULL COMMENT 'identify whether citedDoc is a document or foreign patent',
`Name` varchar(100) DEFAULT NULL,
`Date` date DEFAULT NULL,
`Country` varchar(100) DEFAULT NULL,
`Category` varchar(100) DEFAULT NULL,
PRIMARY KEY (`GrantID`,`Position`),
KEY `FK_PublicationID_PUBLICATION_PCITATION_P` (`GrantID`),
KEY `CitedID` (`CitedID`),
CONSTRAINT `FK_GrantID_GRANT_PCITATION_G0` FOREIGN KEY (`GrantID`) REFERENCES `grants`   (`GrantID`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8

The SHOW CREATE for grants is:

 CREATE TABLE `grants` (
 `GrantID` varchar(20) NOT NULL,
 `Title` varchar(500) DEFAULT NULL,
 `IssueDate` date DEFAULT NULL,
 `Kind` varchar(2) DEFAULT NULL,
 `USSeriesCode` varchar(2) DEFAULT NULL,
 `Abstract` text,
 `ClaimsNum` int(11) DEFAULT NULL,
 `DrawingsNum` int(11) DEFAULT NULL,
 `FiguresNum` int(11) DEFAULT NULL,
 `ApplicationID` varchar(20) NOT NULL,
 `Claims` text,
 `FileDate` date DEFAULT NULL,
 `AppType` varchar(45) DEFAULT NULL,
 `AppNoOrig` varchar(10) DEFAULT NULL,
 `SourceName` varchar(100) DEFAULT NULL,
 PRIMARY KEY (`GrantID`)
 ) ENGINE=InnoDB DEFAULT CHARSET=utf8

Thank you very much for your time, unfortunately I must retire to my bed as it is far too late (or early at this point) for me to continue working at this for now)



One suggestion was to change it into 1 query instead of having a subquery:

  SELECT 
        gracit_g.citedID,
        info_grant.GrantID,
        info_grant.FileDate,
        info_grant.IssueDate,
        info_grant.Kind,
        info_grant.ApplicationID,
        assignee_g.OrgName,
        GROUP_CONCAT(DISTINCT CONCAT_WS(', ', assignee_g.City, assignee_g.State, assignee_g.Country) separator ';') as Assignee,
        GROUP_CONCAT(DISTINCT CONCAT_WS(', ', inventor_g.FirstName, inventor_g.LastName) separator ';') as Inventor,
        GROUP_CONCAT(DISTINCT CONCAT_WS(', ', inventor_g.City, inventor_g.State, inventor_g.Country) separator ';') as Inventor_address,
        GROUP_CONCAT(DISTINCT CONCAT_WS(', ', usclass_g.Class, usclass_g.Subclass) separator ';') as USClass,
        intclass_g.Section,
        intclass_g.Class,
        intclass_g.Subclass,
        intclass_g.MainGroup,
        intclass_g.SubGroup
FROM
    gracit_g, grants as info_grant, assignee_g, inventor_g, usclass_g, intclass_g
WHERE
        gracit_g.GrantID IN (*KEYS*)
        and info_grant.GrantID = gracit_g.GrantID
        and info_grant.GrantID = assignee_g.GrantID
        and info_grant.GrantID = inventor_g.GrantID
        and info_grant.GrantID = usclass_g.GrantID
        and info_grant.GrantID = intclass_g.GrantID
GROUP BY gracit_g.citedID, info_grant.GrantID
LIMIT 50000000

This has cut it down from 21s duration/10s fetch to 13s duration/8s fetch which I would still want to improve as I have many Keys to pass through.

4

1 回答 1

2

您的查询格式为:

SELECT some_fields
FROM (
    SELECT other_fields
    FROM table1, table2
    WHERE join_condition_table1_table2 AND some_other_condition
) AS subquery, table3
WHERE join_condition_subquery_table3
GROUP BY another_field

您需要将其重写如下:

SELECT some_fields
FROM table1, table2, table3
WHERE
    join_condition_table1_table2
    AND join_condition_subquery_table3 -- actually rewrite this ans a join of either table1 and table3, or table2 and table3
    AND some_other_condition
GROUP BY another_field

正如@Ollie Jones 所指出的,选择SELECT既不是GROUP BY条件的一部分也不是聚合函数内部的字段(在子句中)是危险的。如果这些字段不唯一依赖于GROUP BY条件中的字段,则这些字段的值是未定义的。

[编辑]

还有一些建议:

  • gracit_g(citedID, GrantID)按此顺序添加索引( ALTER TABLE gracit_g ADD INDEX(citedID, GrantID);) 并将您的GROUP BY子句更改为GROUP BY gracit_g.citedID, gracit_g.GrantID. 优化器可能会喜欢使用这个索引来计算GROUP BY子句。

  • 如果您的VARCHAR主键实际上是数字,请将其类型更改为合适的整数类型。如果没有,请添加一个数字代理键并将其用作主键。整数比较要快得多,并且您在所有连接中都进行了很多比较。

  • 预先计算连接值,例如CONCAT_WS(', ', assignee_g.City, assignee_g.State, assignee_g.Country) separator ';')在额外的列或额外的表中(后者将需要每个表的额外连接)

  • 增加tmp_table_sizemax_heap_table_size服务器选项。如果临时表的增长大于这两个值中的任何一个(以字节为单位),则临时表将无法保存在内存中,并将被写入磁盘。您可能会从此处异常大的值中受益,因为您正在处理异常大的结果集。

我不知道是否还有其他事情要做。您可能需要考虑返回更小的结果集(更少的列、更多的过滤器或更小的LIMIT)。

于 2013-06-17T14:00:39.203 回答