我有以下 T_SQL 存储过程,它目前占用了在新导入的记录上运行所有进程到我们的后端分析套件所需的总时间的 50%。不幸的是,这些数据每次都需要导入,并且随着我们的数据库大小的增长而导致瓶颈。
基本上,我们正在尝试识别记录中的所有重复项并只保留其中一个。
DECLARE @status INT
SET @status = 3
DECLARE @contactid INT
DECLARE @email VARCHAR (100)
--Contacts
DECLARE email_cursor CURSOR FOR
SELECT email FROM contacts WHERE (reference = @reference AND status = 1 ) GROUP BY email HAVING (COUNT(email) > 1)
OPEN email_cursor
FETCH NEXT FROM email_cursor INTO @email
WHILE @@FETCH_STATUS = 0
BEGIN
PRINT @email
UPDATE contacts SET duplicate = 1, status = @status WHERE email = @email and reference = @reference AND status = 1
SELECT TOP 1 @contactid = id FROM contacts where reference = @reference and email = @email AND duplicate = 1
UPDATE contacts SET duplicate =0, status = 1 WHERE id = @contactid
FETCH NEXT FROM email_cursor INTO @email
END
CLOSE email_cursor
DEALLOCATE email_cursor
我已经添加了我可以从查询执行计划中看到的所有索引,但是可能会更新整个 SP 以不同的方式运行,就像我已经设法与其他人一样。