2

是否可以在 SQL Server 2008 中搜索相似词?

如果用户键入:Ayrton Sena

使用单个'n'它还应该返回Ayrton Senna带有两个的行'nn'

我认为同样的方法适用于拼写检查单词

4

4 回答 4

2

由于“Senna”不是“Sena”的反映,因此很难使用全文索引来解决此任务。

我建议使用全文和字符串相似度的组合来决定两个字符串是否被认为是“相等的”。

因此,如果您搜索多个单词并且允许其中一个单词拼写错误,请使用类似这样的内容

select * 
  from myTable t 
       join FullTextTable(myTable,TextField,'Ayrton Senna') f 
         on f.ID=t.PK
where dbo.MyExternalStringSimilarity('Ayrton Senna', t.TextField)>0.9

现在你只需要一个字符串相似度函数。您可以使用 microsoft 数据质量服务中的“相似性”功能或自己编写。

寻找 Jaro-Winkler、Levenshtein、Dice-Coefficient 等。这些是进行字符串相似性比较的好算法。

当然,您也可以使用扫描整个数据库

select *
 from myTable t
 where dbo.MyExternalStringSimilarity('Ayrton Senna', t.TextField)>0.9

但这可能需要很长时间才能执行。

编辑:但是,我们目前正在使用第一种方法来查找名称的所有相似拼写。它工作得很好。

于 2013-01-26T14:13:03.947 回答
2

我一直在研究一个类似的问题,并偶然发现了一个 1990 年创建的“变音”算法。它实际上是 Soundex 的更准确版本,可用于识别语音上相似的单词。它在某些编程语言中显示为内置函数。 这是“Phil Factor”的一个 SQL Server 等效函数,我们一直在使用它并取得了一些成功。它是从php 的内置变音功能逆向工程的。

我在下面粘贴了一个重新格式化的版本,这样我可以更容易地阅读代码。

IF  OBJECT_ID('Utils.Metaphone','FN') IS NOT NULL --drop any existing metaphone function
   DROP FUNCTION Utils.Metaphone
GO
CREATE FUNCTION Utils.Metaphone
(
    @String VARCHAR(30)
)
RETURNS VARCHAR(10)
AS
BEGIN
DECLARE @New BIT
        ,@ii INT
        ,@Metaphone VARCHAR(28)
        ,@Len INT
        ,@Where INT;
DECLARE @This CHAR
        ,@Next CHAR
        ,@Following CHAR
        ,@Previous CHAR
        ,@Silent BIT;

SELECT  @String = UPPER(LTRIM(COALESCE(@String, ''))); --trim and upper case
SELECT  @Where = PATINDEX ('%[^A-Z]%', @String COLLATE Latin1_General_CI_AI ) 
WHILE   @Where > 0 --strip out all non-alphabetic characters!
BEGIN
    SELECT @String = STUFF(@string, @Where, 1, '')
    SELECT @Where = PATINDEX ('%[^A-Z]%',@String COLLATE Latin1_General_CI_AI ) 
END
IF  (LEN(@String) < 2) RETURN  @String

--do the start of string stuff first.
--If the word begins with 'KN', 'GN', 'PN', 'AE', 'WR', drop the first letter.
-- "Aebersold", "Gnagy", "Knuth", "Pniewski", "Wright"
IF  SUBSTRING(@String, 1, 2) IN ( 'KN', 'GN', 'PN', 'AE', 'WR' )
    SELECT @String = STUFF(@String, 1, 1, '');
-- Beginning of word: "x" change to "s" as in "Deng Xiaopeng"
IF  SUBSTRING(@String, 1, 1) = 'X'
    SELECT @String = STUFF(@String, 1, 1, 'S');
-- Beginning of word: "wh-" change to "w" as in "Whatsoever"
IF  @String LIKE 'WH%'
    SELECT @String = STUFF(@String, 1, 1, 'W');
-- Set up for While loop 
SELECT  @Len = LEN(@String), @Metaphone = '' -- Initialize the main variable 
        ,@New = 1 -- this variable only used next 10 lines!!!
        ,@ii = 1; --Position counter
--
WHILE((LEN(@Metaphone) <= 8) AND (@ii <= @Len))
BEGIN --SET up the 'pointers' for this loop-around }
    SELECT  @Previous = CASE WHEN @ii > 1 THEN SUBSTRING(@String, @ii - 1, 1) ELSE '' END
            -- originally a nul terminated string }
            ,@This = SUBSTRING(@String, @ii, 1)
            ,@Next = CASE WHEN @ii < @Len THEN SUBSTRING(@String, @ii + 1, 1) ELSE '' END
            ,@Following = CASE WHEN((@ii + 1) < @Len) THEN SUBSTRING(@String, @ii + 2, 1) ELSE '' END

    -- 'CC' inside word
    /* Drop duplicate adjacent letters, except for C.*/
    IF  @This=@Previous AND @This<> 'C' 
    BEGIN
        --we do nothing 
        SELECT  @New=0
    END
    /*Drop all vowels unless it is the beginning.*/
    ELSE IF @This IN ( 'A', 'E', 'I', 'O', 'U' )
    BEGIN
        IF  @ii = 1 --vowel at the beginning
            SELECT @Metaphone = @This;
            /* B -> B unless at the end of word after "m", as in "dumb", "Comb" */
    END;
    ELSE IF @This = 'B' AND NOT ((@ii = @Len) AND (@Previous = 'M'))
    BEGIN
        SELECT @Metaphone = @Metaphone + 'B';
    END;
    -- -mb is silent
    /*'C' transforms to 'X' if followed by 'IA' or 'H' (unless in latter case, it is part of '-SCH-',
    in which case it transforms to 'K'). 'C' transforms to 'S' if followed by 'I', 'E', or 'Y'. 
    Otherwise, 'C' transforms to 'K'.*/
    ELSE IF @This = 'C'
    BEGIN -- -sce, i, y = silent 
        IF NOT (@Previous= 'S') AND (@Next IN ( 'H', 'E', 'I', 'Y' )) --front vowel set 
        BEGIN
            IF  (@Next = 'I') AND (@Following = 'A')
                SELECT @Metaphone = @Metaphone + 'X'; -- -cia- 
            ELSE IF(@Next IN ( 'E', 'I', 'Y' ))
                SELECT @Metaphone = @Metaphone + 'S'; -- -ce, i, y = 'S' }
            ELSE IF(@Next = 'H') AND (@Previous = 'S')
                SELECT @Metaphone = @Metaphone + 'K'; -- -sch- = 'K' }
            ELSE IF(@Next = 'H')
            BEGIN
                IF(@ii = 1) AND ((@ii + 2) <= @Len) AND NOT(@Following IN ( 'A', 'E', 'I', 'O', 'U' ))
                    SELECT @Metaphone = @Metaphone + 'K';
                ELSE
                    SELECT @Metaphone = @Metaphone + 'X';
            END
        END
        ELSE 
            SELECT @Metaphone = @Metaphone +CASE WHEN @Previous= 'S' THEN '' ELSE 'K' END;
        -- Else silent 
    END; -- Case C }
    /*'D' transforms to 'J' if followed by 'GE', 'GY', or 'GI'. Otherwise, 'D' 
    transforms to 'T'.*/
    ELSE IF @This = 'D'
    BEGIN
        SELECT @Metaphone = @Metaphone
                + CASE WHEN(@Next = 'G') AND (@Following IN ( 'E', 'I', 'Y' )) --front vowel set 
                        THEN 'J'
                        ELSE 'T'
                  END;
    END;
    ELSE IF @This = 'G'
    /*Drop 'G' if followed by 'H' and 'H' is not at the end or before a vowel. Drop 'G' 
    if followed by 'N' or 'NED' and is at the end.
    'G' transforms to 'J' if before 'I', 'E', or 'Y', and it is not in 'GG'. 
    Otherwise, 'G' transforms to 'K'.*/
    BEGIN
        SELECT @Silent = CASE WHEN (@Next = 'H')
                                    AND (@Following IN ('A','E','I','O','U'))
                                    AND (@ii > 1)
                                    AND (
                                                ((@ii+1) = @Len)
                                            OR 
                                                (
                                                    (@Next = 'n')
                                                    AND (@Following = 'E')
                                                    AND SUBSTRING(@String,@ii+3,1) = 'D'
                                                )
                                            AND ((@ii+3) = @Len)
                                        )
                                    -- Terminal -gned 
                                    AND (@Previous = 'i')
                                    AND (@Next = 'n')
                                THEN 1 
                                -- if not start and near -end or -gned.) 
                              WHEN (@ii > 1)
                                    AND (@Previous = 'D')-- gnuw
                                    AND (@Next IN ('E','I','Y')) --front vowel set 
                                THEN 1 -- -dge, i, y 
                              ELSE 0
                          END
        IF  NOT(@Silent=1)
            SELECT @Metaphone = @Metaphone
                    + CASE WHEN (@Next IN ('E','I','Y')) --front vowel set 
                            THEN 'J'
                            ELSE 'K'
                      END
    END
    /*Drop 'H' if after vowel and not before a vowel.
    or the second char of  "-ch-", "-sh-", "-ph-", "-th-", "-gh-"*/

    ELSE IF @This = 'H'
    BEGIN
        IF  NOT( (@ii= @Len) OR (@Previous IN  ( 'C', 'S', 'T', 'G' ))) 
            AND (@Next IN ( 'A', 'E', 'I', 'O', 'U' ) )
            SELECT @Metaphone = @Metaphone + 'H';
            -- else silent (vowel follows) }
    END;
    ELSE IF @This IN --some get no substitution
                ( 'F', 'J', 'L', 'M', 'N', 'R' )
    BEGIN
        SELECT  @Metaphone = @Metaphone + @This;
    END;
    /*'CK' transforms to 'K'.*/
    ELSE IF @This = 'K'
    BEGIN
        IF  (@Previous <> 'C')
            SELECT @Metaphone = @Metaphone + 'K';
    END;
    /*'PH' transforms to 'F'.*/
    ELSE IF @This = 'P'
    BEGIN
        IF(@Next = 'H')
            SELECT @Metaphone = @Metaphone + 'F', @ii = @ii + 1;
        -- Skip the 'H' 
        ELSE
            SELECT @Metaphone = @Metaphone + 'P';
    END;
    /*'Q' transforms to 'K'.*/
    ELSE IF @This = 'Q'
    BEGIN
        SELECT @Metaphone = @Metaphone + 'K';
    END;
    /*'S' transforms to 'X' if followed by 'H', 'IO', or 'IA'.*/
    ELSE IF @This = 'S'
    BEGIN
        SELECT @Metaphone = @Metaphone
                + CASE WHEN (@Next = 'H')
                            OR
                                (
                                    (@ii> 1)
                                    AND (@Next = 'i') 
                                    AND (@Following IN ( 'O', 'A' ) )
                                )
                        THEN 'X'
                        ELSE 'S'
                  END;
    END;
    /*'T' transforms to 'X' if followed by 'IA' or 'IO'. 'TH' transforms 
    to '0'. Drop 'T' if followed by 'CH'.*/
    ELSE IF @This = 'T'
    BEGIN
        SELECT @Metaphone = @Metaphone
                + CASE WHEN (@ii = 1)
                            AND (@Next = 'H')
                            AND (@Following = 'O') 
                        THEN 'T' -- Initial Tho- }
                       WHEN (@ii > 1)
                            AND (@Next = 'i') 
                            AND (@Following IN ( 'O', 'A' )) 
                        THEN 'X'
                       WHEN (@Next = 'H')
                        THEN '0'
                       WHEN NOT((@Next = 'C')
                                AND (@Following = 'H')) 
                        THEN 'T'
                       ELSE ''
                  END;
                -- -tch = silent }
    END;
    /*'V' transforms to 'F'.*/
    ELSE IF @This = 'V'
    BEGIN
        SELECT @Metaphone = @Metaphone + 'F';
    END;
    /*'WH' transforms to 'W' if at the beginning. Drop 'W' if not followed by a vowel.*/
    /*Drop 'Y' if not followed by a vowel.*/
    ELSE IF @This IN ( 'W', 'Y' )
    BEGIN
    IF @Next IN ( 'A', 'E', 'I', 'O', 'U' )
        SELECT @Metaphone = @Metaphone + @This;
    --else silent 
    /*'X' transforms to 'S' if at the beginning. Otherwise, 'X' transforms to 'KS'.*/
    END;
    ELSE IF @This = 'X'
    BEGIN
        SELECT @Metaphone = @Metaphone + 'KS';
    END;
    /*'Z' transforms to 'S'.*/
    ELSE IF @This = 'Z'
    BEGIN
        SELECT @Metaphone = @Metaphone + 'S';
    END;
    ELSE
        RETURN 'error with '''+ @This+ '''';
    -- end
    SELECT @ii = @ii + 1;
END; -- While 
RETURN @Metaphone 
END

以下测试都生成相同的结果。

SELECT  Utils.Metaphone('Aryton Sena')
        ,Utils.Metaphone('Aryton Senna')
        ,Utils.Metaphone('Ayrton Senna')
        ,Utils.Metaphone('Ayrton Sena')
        ,Utils.Metaphone('Aryten Sena')
        ,Utils.Metaphone('Aryten Senna')
        ,Utils.Metaphone('Ayrten Senna')
        ,Utils.Metaphone('Ayrten Sena');

结果:

ARTNSN
于 2020-06-26T09:00:30.017 回答
0

拼写检查器通常通过在字典中查找单词来工作。如果您的单词与字典中的单词完全匹配,则拼写正确。如果不是,则找到最接近的匹配项,并建议将其作为替代项。一些拼写检查器持有替代拼写或常见的拼写错误,但这并没有从根本上改变它们的工作方式。

Jaro-Winkler 是一种距离度量,因为它测量两个单词之间的“距离”,即从第一个单词到第二个单词需要进行多少换位。Jaro 通常用于匹配人名,因为这是它擅长的。它也可以用于更一般的匹配,但您需要注意缩写等,因为它们会混淆它。

性能应该不是问题。我通常在 .NET 应用程序中实现 Jaro Winkler 算法,因为编写为 SQL UDF 很棘手。我想你也可以使用外部 CLR 存储过程?这在匹配数万条记录时表现良好。如果您可能要匹配数百万个名称,那么性能可能更受关注?

这是您如何处理此问题的示例:http: //isolvable.blogspot.co.uk/2011/05/jaro-winkler-fast-fuzzy-linkage.html

于 2013-06-21T12:54:16.810 回答
0

查看全文搜索。这允许各种搜索,包括不同的单词形式。您可以配置单词形式或使用开箱即用的字典。

引用(强调我的)

全文查询通过根据特定语言(如英语或日语)的规则对单词和短语进行操作,对全文索引中的文本数据执行语言搜索。全文查询可以包括简单的单词和短语或单词或短语的多种形式。

请参阅有关词库的此答案。

全文搜索的替代方法是Lucene

于 2013-01-26T13:41:43.387 回答