1

I am working with a CSV file of addresses that get uploaded to an FTP site on a daily basis. The address records consist of Address_Line1, Address_Line2, City, State, Zip_Code, and Country. There are multiple duplicate addresses in the CSV. My task is to compare all of the addresses in the CSV file to an existing address dimension, then only load new addresses. To do this, I am first loading from the CSV into a staging table, then running the following query to generate a hash in order to do the comparison:

UPDATE STG_ADDRESS
SET ADDRESS_HASH = HASHBYTES(
    'SHA1'
    ,ISNULL(ADDRESS_LINE1, 'N/A') + 
    ISNULL(ADDRESS_LINE2, 'N/A') +
    ISNULL(CITY, 'N/A') +
    ISNULL(STATE, 'N/A') +
    ISNULL(ZIP_CODE, 'N/A') + 
    ISNULL(COUNTRY, 'N/A'));

This is working fine, with one exception. The HASHBYTES function is generating multiple hashes for the same exact address. For today's upload, I ran the following query and got 37 distinct addresses:

SELECT DISTINCT 
    ISNULL(ADDRESS_LINE1, 'N/A') 
    + ISNULL(ADDRESS_LINE2, 'N/A')
    + ISNULL(CITY, 'N/A')
    + ISNULL(STATE, 'N/A')
    + ISNULL(ZIP_CODE, 'N/A') 
    + ISNULL(COUNTRY, 'N/A')

FROM STG_ADDRESS

After updating with the Hash, I ran the following query and got 43 records:

SELECT DISTINCT 
    ISNULL(ADDRESS_LINE1, 'N/A') 
    + ISNULL(ADDRESS_LINE2, 'N/A')
    + ISNULL(CITY, 'N/A')
    + ISNULL(STATE, 'N/A')
    + ISNULL(ZIP_CODE, 'N/A') 
    + ISNULL(COUNTRY, 'N/A')
    ,ADDRESS_HASH

FROM STG_ADDRESS

I double checked this with the following query:

SELECT DISTINCT 
    ISNULL(ADDRESS_LINE1, 'N/A') 
    + ISNULL(ADDRESS_LINE2, 'N/A')
    + ISNULL(CITY, 'N/A')
    + ISNULL(STATE, 'N/A')
    + ISNULL(ZIP_CODE, 'N/A') 
    + ISNULL(COUNTRY, 'N/A')
    ,COUNT(ADDRESS_HASH)

FROM STG_ADDRESS

GROUP BY
    ISNULL(ADDRESS_LINE1, 'N/A') 
    + ISNULL(ADDRESS_LINE2, 'N/A')
    + ISNULL(CITY, 'N/A')
    + ISNULL(STATE, 'N/A')
    + ISNULL(ZIP_CODE, 'N/A') 
    + ISNULL(COUNTRY, 'N/A')

HAVING COUNT(ADDRESS_HASH) > 1

And saw that there are six addresses that SQL server considers the same when running a SELECT DISTINCT, but considers different somehow when creating the hash.

Is there any scenario where the same string could result in a different hash being created? If so, what can be done to correct the issue?

4

1 回答 1

3

在调用 HASHBYTES 之前应该对空格和大小写进行规范化,因为它始终区分大小写。默认情况下,正常 SQL 操作期间的字符串比较不区分大小写(您可以使用 COLLATION 服务器设置进行修改)。

LTRIM(RTRIM(TOLOWER(@value)))
于 2015-03-20T16:05:57.740 回答