Table of People (name, dob, ssn etc)
Table of NewRecords (name, dob, ssn)
I want to write a query that determines which NewRecords do not in any way match any of the People (it's an update query that sets a flag in the NewRecords table).
Specifically, I want to find the NewRecords for which the Levenshtein distance between the first name, last name and ssn is greater than 2 for all records in People. (i.e. the person has a first, last and ssn that are all different from those in People and so is likely not a match).
I added a user-defined Levensthein function Levenshtein distance in T-SQL and have already added an optimization that adds an extra parameter for maximal allowed distance. (If the calculated levenstein climbs above the max allowed, the function exits early). But the query is still taking an unacceptably long time because the tables are large.
What can I do to speed things up? How do I start thinking about optimization and performance? At what point do I need to start digging into the internals of sql server?
update NewRecords
set notmatchflag=true
from
newRecords elr
inner join People pro
on
dbo.Levenshtein_withThreshold(elr.last,pro.last,2)>2 and
dbo.Levenshtein_withThreshold(elr.first,pro.first,2)>2 and
elr.ssn is null and
elr.dob<>pro.dob