I am working with a CSV file of addresses that get uploaded to an FTP site on a daily basis. The address records consist of Address_Line1, Address_Line2, City, State, Zip_Code, and Country. There are multiple duplicate addresses in the CSV. My task is to compare all of the addresses in the CSV file to an existing address dimension, then only load new addresses. To do this, I am first loading from the CSV into a staging table, then running the following query to generate a hash in order to do the comparison:
UPDATE STG_ADDRESS
SET ADDRESS_HASH = HASHBYTES(
'SHA1'
,ISNULL(ADDRESS_LINE1, 'N/A') +
ISNULL(ADDRESS_LINE2, 'N/A') +
ISNULL(CITY, 'N/A') +
ISNULL(STATE, 'N/A') +
ISNULL(ZIP_CODE, 'N/A') +
ISNULL(COUNTRY, 'N/A'));
This is working fine, with one exception. The HASHBYTES function is generating multiple hashes for the same exact address. For today's upload, I ran the following query and got 37 distinct addresses:
SELECT DISTINCT
ISNULL(ADDRESS_LINE1, 'N/A')
+ ISNULL(ADDRESS_LINE2, 'N/A')
+ ISNULL(CITY, 'N/A')
+ ISNULL(STATE, 'N/A')
+ ISNULL(ZIP_CODE, 'N/A')
+ ISNULL(COUNTRY, 'N/A')
FROM STG_ADDRESS
After updating with the Hash, I ran the following query and got 43 records:
SELECT DISTINCT
ISNULL(ADDRESS_LINE1, 'N/A')
+ ISNULL(ADDRESS_LINE2, 'N/A')
+ ISNULL(CITY, 'N/A')
+ ISNULL(STATE, 'N/A')
+ ISNULL(ZIP_CODE, 'N/A')
+ ISNULL(COUNTRY, 'N/A')
,ADDRESS_HASH
FROM STG_ADDRESS
I double checked this with the following query:
SELECT DISTINCT
ISNULL(ADDRESS_LINE1, 'N/A')
+ ISNULL(ADDRESS_LINE2, 'N/A')
+ ISNULL(CITY, 'N/A')
+ ISNULL(STATE, 'N/A')
+ ISNULL(ZIP_CODE, 'N/A')
+ ISNULL(COUNTRY, 'N/A')
,COUNT(ADDRESS_HASH)
FROM STG_ADDRESS
GROUP BY
ISNULL(ADDRESS_LINE1, 'N/A')
+ ISNULL(ADDRESS_LINE2, 'N/A')
+ ISNULL(CITY, 'N/A')
+ ISNULL(STATE, 'N/A')
+ ISNULL(ZIP_CODE, 'N/A')
+ ISNULL(COUNTRY, 'N/A')
HAVING COUNT(ADDRESS_HASH) > 1
And saw that there are six addresses that SQL server considers the same when running a SELECT DISTINCT, but considers different somehow when creating the hash.
Is there any scenario where the same string could result in a different hash being created? If so, what can be done to correct the issue?