8

"Returns a number that indicates how similar the first string to the most similar word of the second string. The function searches in the second string a most similar word not a most similar substring. The range of the result is zero (indicating that the two strings are completely dissimilar) to one (indicating that the first string is identical to one of the words of the second string)."

That's the definition of word_similarity(a,b), as I understand it, it will look for the WORD a inside the text b, splitting b by words and getting the score of the highest match word.

However, I'm seeing some inconsistencies where the word matching is not really by word, looks like all trigrams are scrambled and compared?

Example:

select word_similarity('sage', 'message sag')

Returns 1, clearly neither 'message' or 'sag' should match with 'sage', but if we combine the possible trigrams from 'message sag', we would then find that all the trigrams from 'sage' would match, but that's not really what should happen since the function description talks about word by word... Is it because both words are next to each other?

The following, will return a 0.6 score:

select word_similarity('sage', 'message test sag') 

Edit: Fiddle to play around http://sqlfiddle.com/#!17/b4bab/1

4

1 回答 1

10

功能与描述不符

pgsql-bugs 邮件列表上的相关主题。

作者描述的子串相似度算法比较了查询字符串和文本的三元数组。问题是三元组被优化(重复的三元组被消除)并且丢失了关于文本中单个单词的信息。

该查询说明了这个问题:

with data(t) as (
values
    ('message'),
    ('message s'),
    ('message sag'),
    ('message sag sag'),
    ('message sag sage')
)

select 
    t as "text", 
    show_trgm(t) as "text trigrams", 
    show_trgm('sage') as "string trigrams", 
    cardinality(array_intersect(show_trgm(t), show_trgm('sage'))) as "common trgms"
from data;

       text       |                       text trigrams                       |       string trigrams       | common trgms 
------------------+-----------------------------------------------------------+-----------------------------+--------------
 message          | {"  m"," me",age,ess,"ge ",mes,sag,ssa}                   | {"  s"," sa",age,"ge ",sag} |            3
 message s        | {"  m","  s"," me"," s ",age,ess,"ge ",mes,sag,ssa}       | {"  s"," sa",age,"ge ",sag} |            4
 message sag      | {"  m","  s"," me"," sa","ag ",age,ess,"ge ",mes,sag,ssa} | {"  s"," sa",age,"ge ",sag} |            5
 message sag sag  | {"  m","  s"," me"," sa","ag ",age,ess,"ge ",mes,sag,ssa} | {"  s"," sa",age,"ge ",sag} |            5
 message sag sage | {"  m","  s"," me"," sa","ag ",age,ess,"ge ",mes,sag,ssa} | {"  s"," sa",age,"ge ",sag} |            5
(5 rows)    

最后三行中的 trigram 数组是相同的,并且包含查询字符串的所有 trigram。

显然,实现与函数描述不一致(描述在文档的后期版本中有所更改):

返回一个数字,指示第一个字符串与第二个字符串中最相似的单词的相似程度。该函数在第二个字符串中搜索最相似的单词而不是最相似的子字符串。


我在上述查询中使用的函数:

create or replace function public.array_intersect(anyarray, anyarray)
returns anyarray language sql immutable
as $$
    select case 
        when $1 is null then $2
        else
            array(
                select unnest($1)
                intersect
                select unnest($2)
            )
        end;
$$;

解决方法

您可以轻松编写自己的函数以获得更多预期结果:

create or replace function my_word_similarity(text, text)
returns real language sql immutable as $$
    select max(similarity($1, word))
    from regexp_split_to_table($2, '[^[:alnum:]]') word
$$;

相比:

with data(t) as (
values
    ('message'),
    ('message s'),
    ('message sag'),
    ('message sag sag'),
    ('message sag sage')
)

select t, word_similarity('sage', t), my_word_similarity('sage', t)
from data;

        t         | word_similarity | my_word_similarity
------------------+-----------------+--------------------
 message          |             0.6 |                0.3
 message s        |             0.8 |                0.3
 message sag      |               1 |                0.5
 message sag sag  |               1 |                0.5
 message sag sage |               1 |                  1
(5 rows)

Postgres 11+ 中的新功能

Postgres 11+ 中有一个新函数,strict_word_similarity()它给出了问题作者所期望的结果:

with data(t) as (
values
    ('message'),
    ('message s'),
    ('message sag'),
    ('message sag sag'),
    ('message sag sage')
)

select t, word_similarity('sage', t), strict_word_similarity('sage', t)
from data;

        t         | word_similarity | strict_word_similarity
------------------+-----------------+------------------------
 message          |             0.6 |                    0.3
 message s        |             0.8 |             0.36363637
 message sag      |               1 |                    0.5
 message sag sag  |               1 |                    0.5
 message sag sage |               1 |                      1
(5 rows)
于 2017-10-27T12:45:27.543 回答