1

I am using mysql and I want to check the duplicate rows between two tables. I used join but it's taking too much time as there are millions of records (for example staging table has 800k records while the main table has around 100 million records).

The query I am using is as follow :

INSERT INTO 
    tblspduplicate
SELECT 
    T2.SP,T1.FileImportedDate,T2.XYZFileName 
FROM  
    tblspmaster T1
INNER JOIN 
    tblstaging T2 
ON 
    T1.SP=T2.SP;

CREATE TABLE `tblspmaster` (
  `CSN` bigint(20) NOT NULL AUTO_INCREMENT,
  `SP` varchar(50) NOT NULL,
  `FileImportedDate` date NOT NULL,
  `XYZFileName` varchar(50) NOT NULL,
  `XYZBatch` varchar(50) NOT NULL,
  `BatchProcessedDate` date NOT NULL,
  `ExpiryDate` date NOT NULL,
  `Region` varchar(50) NOT NULL,
  `FCCity` varchar(50) NOT NULL,
  `VendorID` int(11) NOT NULL,
  `LocationID` int(11) NOT NULL,
  PRIMARY KEY (`CSN`)
) ENGINE=InnoDB AUTO_INCREMENT=7484570 DEFAULT CHARSET=latin1;


CREATE TABLE `tblstaging` (
  `CSN` bigint(20) NOT NULL AUTO_INCREMENT,
  `SP` varchar(50) NOT NULL,
  `FileImportedDate` date NOT NULL,
  `XYZFileName` varchar(50) NOT NULL,
  `XYZBatch` varchar(50) NOT NULL,
  `BatchProcessedDate` date NOT NULL,
  `ExpiryDate` date NOT NULL,
  `Region` varchar(50) NOT NULL,
  `FCCity` varchar(50) NOT NULL,
  `VendorID` int(11) NOT NULL,
  `LocationID` int(11) NOT NULL,
  PRIMARY KEY (`CSN`),
  KEY `ind_staging` (`SP`)
) ENGINE=InnoDB AUTO_INCREMENT=851956 DEFAULT CHARSET=latin1;
4

2 回答 2

1

你有索引tblspmaster.SP吗?那将是最重要的事情。有了这样的索引,你查询应该没问题。不过,首先,仅使用select.

您可能遇到的另一个问题是重复匹配。这可能会显着增加您拥有的数据。您可以通过以下方式对此进行测试:

select sp, count(*) as cnt
from tblmaster
group by sp
having cnt > 1
order by cnt desc;

select sp, count(*) as cnt
from tblstaging
having cnt > 1
order by cnt desc;

编辑:

根据表结构,我重复建议为tblMaster(SP). 您可能还需要将索引放在tblStaging(SP). 或者,您可以通过使用索引提示强制使用主索引而不是暂存索引(此处描述了简单的语法)。

另外,我建议您运行上述计数以查看由于 SP 值的多重性而导致意外大量行的风险。

于 2013-09-07T12:13:39.667 回答
-1

也许您使用 INTERSECt SQL,但我现在不知道他可以花多少时间。

于 2013-09-07T12:13:16.813 回答