功能与描述不符
pgsql-bugs 邮件列表上的相关主题。
作者描述的子串相似度算法比较了查询字符串和文本的三元数组。问题是三元组被优化(重复的三元组被消除)并且丢失了关于文本中单个单词的信息。
该查询说明了这个问题:
with data(t) as (
values
('message'),
('message s'),
('message sag'),
('message sag sag'),
('message sag sage')
)
select
t as "text",
show_trgm(t) as "text trigrams",
show_trgm('sage') as "string trigrams",
cardinality(array_intersect(show_trgm(t), show_trgm('sage'))) as "common trgms"
from data;
text | text trigrams | string trigrams | common trgms
------------------+-----------------------------------------------------------+-----------------------------+--------------
message | {" m"," me",age,ess,"ge ",mes,sag,ssa} | {" s"," sa",age,"ge ",sag} | 3
message s | {" m"," s"," me"," s ",age,ess,"ge ",mes,sag,ssa} | {" s"," sa",age,"ge ",sag} | 4
message sag | {" m"," s"," me"," sa","ag ",age,ess,"ge ",mes,sag,ssa} | {" s"," sa",age,"ge ",sag} | 5
message sag sag | {" m"," s"," me"," sa","ag ",age,ess,"ge ",mes,sag,ssa} | {" s"," sa",age,"ge ",sag} | 5
message sag sage | {" m"," s"," me"," sa","ag ",age,ess,"ge ",mes,sag,ssa} | {" s"," sa",age,"ge ",sag} | 5
(5 rows)
最后三行中的 trigram 数组是相同的,并且包含查询字符串的所有 trigram。
显然,实现与函数描述不一致(描述在文档的后期版本中有所更改):
返回一个数字,指示第一个字符串与第二个字符串中最相似的单词的相似程度。该函数在第二个字符串中搜索最相似的单词而不是最相似的子字符串。
我在上述查询中使用的函数:
create or replace function public.array_intersect(anyarray, anyarray)
returns anyarray language sql immutable
as $$
select case
when $1 is null then $2
else
array(
select unnest($1)
intersect
select unnest($2)
)
end;
$$;
解决方法
您可以轻松编写自己的函数以获得更多预期结果:
create or replace function my_word_similarity(text, text)
returns real language sql immutable as $$
select max(similarity($1, word))
from regexp_split_to_table($2, '[^[:alnum:]]') word
$$;
相比:
with data(t) as (
values
('message'),
('message s'),
('message sag'),
('message sag sag'),
('message sag sage')
)
select t, word_similarity('sage', t), my_word_similarity('sage', t)
from data;
t | word_similarity | my_word_similarity
------------------+-----------------+--------------------
message | 0.6 | 0.3
message s | 0.8 | 0.3
message sag | 1 | 0.5
message sag sag | 1 | 0.5
message sag sage | 1 | 1
(5 rows)
Postgres 11+ 中的新功能
Postgres 11+ 中有一个新函数,strict_word_similarity()
它给出了问题作者所期望的结果:
with data(t) as (
values
('message'),
('message s'),
('message sag'),
('message sag sag'),
('message sag sage')
)
select t, word_similarity('sage', t), strict_word_similarity('sage', t)
from data;
t | word_similarity | strict_word_similarity
------------------+-----------------+------------------------
message | 0.6 | 0.3
message s | 0.8 | 0.36363637
message sag | 1 | 0.5
message sag sag | 1 | 0.5
message sag sage | 1 | 1
(5 rows)