I'm currently working on a project involving Patents pulled the USPTO website, as part of this project I'm using the database created by people from the University of Illinois
(Paper: http://abel.lis.illinois.edu/UPDC/USPTOPatentsDatabaseConstruction.pdf)
(Slightly outdated Schema of the Tables I'm using, only missing non-index/key values: http://i.imgur.com/44LHS3L.png)
Now as the title says I'm trying to optimize a query:
SELECT
PN,
AN,
grants.GrantID,
grants.FileDate,
grants.IssueDate,
grants.Kind,
grants.ApplicationID,
assignee_g.OrgName,
GROUP_CONCAT(DISTINCT CONCAT_WS(', ', assignee_g.City, assignee_g.State, assignee_g.Country) separator ';') as Assignee,
GROUP_CONCAT(DISTINCT CONCAT_WS(', ', inventor_g.FirstName, inventor_g.LastName) separator ';') as Inventor,
GROUP_CONCAT(DISTINCT CONCAT_WS(', ', inventor_g.City, inventor_g.State, inventor_g.Country) separator ';') as Inventor_address,
GROUP_CONCAT(DISTINCT CONCAT_WS(', ', usclass_g.Class, usclass_g.Subclass) separator ';') as USClass,
intclass_g.Section,
intclass_g.Class,
intclass_g.Subclass,
intclass_g.MainGroup,
intclass_g.SubGroup
FROM
(
SELECT grants.GrantID as CitingID, CitedID as PN, grants2.ApplicationID AS AN
FROM
gracit_g, grants, grants as grants2
Where
grants.GrantID IN (*A
couple
Thousand
keys*)
and grants.GrantID = gracit_g.GrantID and grants2.GrantID = CitedID
LIMIT 500000) tbl1,
grants, assignee_g, inventor_g, usclass_g, intclass_g
WHERE
grants.GrantID = tbl1.CitingID
and grants.GrantID = assignee_g.GrantID
and grants.GrantID = inventor_g.GrantID
and grants.GrantID = usclass_g.GrantID
and grants.GrantID = intclass_g.GrantID
GROUP BY PN, GrantID
LIMIT 50000000
Pretty much for each Patent Being cited by one that came after it I want to record the information of the patent citing it. The problem I seem to be running into is that my "GROUP BY PN, GrantID" is causing "Using Temporary, Using Filesort" which is severely slowing down my endeavours.
This is what my explain gave to me (sorry if it's not perfectly formatted, I couldn't find how to make a table)
1
PRIMARY
derived2
ALL
8716
possible key: null
key: null
key_len: null
ref: null
Using temporary; Using filesort
1
PRIMARY
grants
eq_ref
PRIMARY
PRIMARY
62
tbl1.CitingID
1
1
PRIMARY
assignee_g
ref
PRIMARY,FK_PublicationID_PUBLICATION_ASSIGNEE_P
PRIMARY
62
tbl1.CitingID
1
1
PRIMARY
intclass_g
ref
PRIMARY,fk_publicationid_PUBLICATION_INTERNATIONALCLASS_P
PRIMARY
62
tbl1.CitingID
1
1
PRIMARY
inventor_g
ref
PRIMARY,fk_PublicationID_Inventor_p
PRIMARY
62
tbl1.CitingID
1
1
PRIMARY
usclass_g
ref
PRIMARY,fk_publicationid_PUBLICATION_USCLASS_P
PRIMARY
62
tbl1.CitingID
2
2
DERIVED
grants
range
PRIMARY
PRIMARY
62
ref: null
2179
Using where; Using index
2
DERIVED
gracit_g
ref
PRIMARY,FK_PublicationID_PUBLICATION_PCITATION_P,CitedID
PRIMARY
62
uspto_patents.grants.GrantID
4
Using where
2
DERIVED
grants2
eq_ref
PRIMARY
PRIMARY
62
uspto_patents.gracit_g.CitedID
1
The SHOW CREATE for gracit_g is:
CREATE TABLE `gracit_g` (
`GrantID` varchar(20) NOT NULL,
`Position` int(11) NOT NULL,
`CitedID` varchar(20) DEFAULT NULL,
`Kind` varchar(10) DEFAULT NULL COMMENT 'identify whether citedDoc is a document or foreign patent',
`Name` varchar(100) DEFAULT NULL,
`Date` date DEFAULT NULL,
`Country` varchar(100) DEFAULT NULL,
`Category` varchar(100) DEFAULT NULL,
PRIMARY KEY (`GrantID`,`Position`),
KEY `FK_PublicationID_PUBLICATION_PCITATION_P` (`GrantID`),
KEY `CitedID` (`CitedID`),
CONSTRAINT `FK_GrantID_GRANT_PCITATION_G0` FOREIGN KEY (`GrantID`) REFERENCES `grants` (`GrantID`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8
The SHOW CREATE for grants is:
CREATE TABLE `grants` (
`GrantID` varchar(20) NOT NULL,
`Title` varchar(500) DEFAULT NULL,
`IssueDate` date DEFAULT NULL,
`Kind` varchar(2) DEFAULT NULL,
`USSeriesCode` varchar(2) DEFAULT NULL,
`Abstract` text,
`ClaimsNum` int(11) DEFAULT NULL,
`DrawingsNum` int(11) DEFAULT NULL,
`FiguresNum` int(11) DEFAULT NULL,
`ApplicationID` varchar(20) NOT NULL,
`Claims` text,
`FileDate` date DEFAULT NULL,
`AppType` varchar(45) DEFAULT NULL,
`AppNoOrig` varchar(10) DEFAULT NULL,
`SourceName` varchar(100) DEFAULT NULL,
PRIMARY KEY (`GrantID`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
Thank you very much for your time, unfortunately I must retire to my bed as it is far too late (or early at this point) for me to continue working at this for now)
One suggestion was to change it into 1 query instead of having a subquery:
SELECT
gracit_g.citedID,
info_grant.GrantID,
info_grant.FileDate,
info_grant.IssueDate,
info_grant.Kind,
info_grant.ApplicationID,
assignee_g.OrgName,
GROUP_CONCAT(DISTINCT CONCAT_WS(', ', assignee_g.City, assignee_g.State, assignee_g.Country) separator ';') as Assignee,
GROUP_CONCAT(DISTINCT CONCAT_WS(', ', inventor_g.FirstName, inventor_g.LastName) separator ';') as Inventor,
GROUP_CONCAT(DISTINCT CONCAT_WS(', ', inventor_g.City, inventor_g.State, inventor_g.Country) separator ';') as Inventor_address,
GROUP_CONCAT(DISTINCT CONCAT_WS(', ', usclass_g.Class, usclass_g.Subclass) separator ';') as USClass,
intclass_g.Section,
intclass_g.Class,
intclass_g.Subclass,
intclass_g.MainGroup,
intclass_g.SubGroup
FROM
gracit_g, grants as info_grant, assignee_g, inventor_g, usclass_g, intclass_g
WHERE
gracit_g.GrantID IN (*KEYS*)
and info_grant.GrantID = gracit_g.GrantID
and info_grant.GrantID = assignee_g.GrantID
and info_grant.GrantID = inventor_g.GrantID
and info_grant.GrantID = usclass_g.GrantID
and info_grant.GrantID = intclass_g.GrantID
GROUP BY gracit_g.citedID, info_grant.GrantID
LIMIT 50000000
This has cut it down from 21s duration/10s fetch to 13s duration/8s fetch which I would still want to improve as I have many Keys to pass through.