我有一个非常大的表,有超过 1000 万条记录。我想根据一些匹配的字段和一些不匹配的字段来查找重复项。
我目前使用的查询如下:
SELECT DISTINCT MainTable.[lineitemid]
FROM [dbo].[lineitem] MainTable
INNER JOIN [dbo].[lineitem] AS ChildTable
ON ChildTable.invoicedate = MainTable.invoicedate
AND LEFT(ChildTable.vendorname, 4) = LEFT(MainTable.vendorname, 4)
AND ChildTable.invoiceid <> MainTable.invoiceid AND -- Invoice ID column not matching
ChildTable.documentcurrencyamount = MainTable.documentcurrencyamount
WHERE ChildTable.lineitemid <> MainTable.lineitemid AND -- LineItemId is PK
MainTable.projectid = 1125 AND ChildTable.projectid = 1125 -- Duplicates should be identified with specific ProjectId
如果 ProjectId 的记录数低于 100,000,则此查询工作正常。当 ProjectId 记录超过 100 万条时,在执行此查询时,tempdb 大小会飙升至 100 GB,并导致磁盘空间不足的问题。查询需要永远执行。
请帮助我优化查询。
在获得上述查询的答案后添加了以下行......
非常感谢,@Gordon-Linoff。您建议的查询工作得更快。VendorName 来自不同的表。我可以包含如下所示的内部连接吗?
SELECT li1.[LineItemId]
FROM [dbo].[LineItem] li1
INNER JOIN VendorMaster vm1 ON li1.VendorNumber=vm1.VendorNumber
AND vm1.CompanyCode = li1.CompanyCode
WHERE EXISTS (SELECT 1
FROM [dbo].[LineItem] as li2
INNER JOIN VendorMaster vm2 on li2.VendorNumber = vm2.VendorNumber
AND vm2.CompanyCode = li2.CompanyCode
WHERE li2.InvoiceDate = li.InvoiceDate and
LEFT(li2.VendorName, 4) = LEFT(li1.VendorName, 4) and
li2.InvoiceId <> li1.InvoiceId and -- Invoice ID column not matching
li2.DocumentCurrencyAmount = li1.DocumentCurrencyAmount and
li2.LineItemId <> li1.LineItemId and
li2.ProjectId = li1.ProjectId
li2.VendorNumber = li.VendorNumber)
AND li.ProjectId = 1125
这是一种有效的方法吗?