1

我有一个 Web 应用程序,用户可以上传带有某个目的地的 Excel 文件。上传文件后,我读取行并将它们插入 SQL Server 数据库。
在 SQL Server 上,我必须将目标与表中的目标列表进行匹配。由于数据库中的目的地列表是参考,因此匹配必须准确。

这是来自数据库的目的地和用户上传的目的地的示例(其中必须匹配):

  • 来自数据库:英国 - 移动 - O2
  • 由用户所下载的: United Kingdom - O2 Mobile

进行更准确匹配的最佳方法是什么?

4

3 回答 3

1

I don't think this problem can be solved using T-SQL only. Unfortunately T-SQL has no good algorithms for fuzzy matching. Soundex is not very relevant, full text search neither for this problem.

I would recommend a very good library written in C# http://anastasiosyal.com/post/2009/01/11/Beyond-SoundEx-Functions-for-Fuzzy-Searching-in-MS-SQL-Server. It implements a lot of string metric algorithms like and can be imported as CLR functions in SQL Server. Can have performance issues for a large amount of data.

I also can recommend, especially because you import data, to create a SSIS package. In a package you can use Fuzzy Lookup Transformation block to identify similarities: http://msdn.microsoft.com/en-us/magazine/cc163731.aspx. I use it to identify duplicates, based on similarity, in a table with more than 1 million records. Also in both cases you will have to run some tests in order to define the percent of similarity for an accurate matching in case of your business.

于 2013-06-05T12:13:35.210 回答
1

我已经解决了很多这样的问题。将数据库数据拆分为临时表中的相关列(国家、设备、品牌)。在导入数据库之前,将用户输入数据 (excel) 拆分为相关列(国家、设备、品牌)。然后将excel数据导入临时表。然后,您可以随意调整匹配。

于 2013-06-10T12:13:44.080 回答
0

您需要定义一个匹配算法。如果是通过计算匹配的单词,无论它们出现的顺序是什么,这里是:

declare @t table(field varchar(200))
insert into @t values('United Kingdom - Mobile - O2')
declare @upload varchar(200) = ' United   Kingdom  -  O2    Mobile noise'

-- Let's find matching words, no matter in what order they are!
declare @IgnoreChars varchar(50) = char(13)+char(10)+char(9)+'-.,'
select t.field,
    MatchedWords = SUM(CASE WHEN m.WordFoundAt=0 THEN 0 ELSE 1 END),
    TotalWords = COUNT(*)
from @t t
    CROSS APPLY dbo.str_split(dbo.str_translate(@upload, @IgnoreChars, REPLICATE(' ', LEN(@IgnoreChars))), ' ') w
    OUTER APPLY (SELECT WordFoundAt = CHARINDEX(w.id, t.field)) m
where w.id <> ''
group by t.field

结果:

字段 MatchedWords TotalWords

英国 - 移动 - O2 4 5

函数 str_translate 和 str_split 不是内置的,但我不知道如何在此处发布它们,因为不允许使用附件。

于 2013-06-10T11:39:28.720 回答