trigram 相似性的概念依赖于将任何句子划分为“trigrams”(三个连续字母的序列),并将结果视为 SET(即:顺序无关紧要,并且您没有重复的值)。在考虑句子之前,在开头添加两个空格,在末尾添加一个空格,并将单个空格替换为两个空格。
Trigrams是N-grams的一个特例。
与“Chateau blanc”相对应的三元组是通过查找出现在其上的三个字母的所有序列来找到的:
chateau blanc
--- => ' c'
--- => ' ch'
--- => 'cha'
--- => 'hat'
--- => 'ate'
--- => 'tea'
--- => 'eau'
--- => 'au '
--- => 'u '
--- => ' b'
--- => ' bl'
--- => 'bla'
--- => 'lan'
--- => 'anc'
--- => 'nc '
对它们进行排序并删除重复项可以让您:
' b'
' c'
' bl'
' ch'
'anc'
'ate'
'au '
'bla'
'cha'
'eau'
'hat'
'lan'
'nc '
'tea'
这可以由 PostgreSQL 通过函数计算show_trgm
:
SELECT show_trgm('Chateau blanc') AS A
A = [ b, c, bl, ch,anc,ate,au ,bla,cha,eau,hat,lan,nc ,tea]
... 有 14 个三元组。(检查pg_trgm)。
而“Chateau Cheval Blanc”对应的卦集是:
SELECT show_trgm('Chateau Cheval Blanc') AS B
B = [ b, c, bl, ch,anc,ate,au ,bla,cha,che,eau,evl,hat,hev,la ,lan,nc ,tea,vla]
... 有 19 个三元组
如果你计算有多少三元组有两个相同的集合,你会发现它们有以下几个:
A intersect B =
[ b, c, bl, ch,anc,ate,au ,bla,cha,eau,hat,lan,nc ,tea]
他们总共拥有的是:
A union B =
[ b, c, bl, ch,anc,ate,au ,bla,cha,che,eau,evl,hat,hev,la ,lan,nc ,tea,vla]
也就是说,两个句子共有 14 个三元组,总共 19 个。
相似度计算如下:
similarity = 14 / 19
您可以通过以下方式进行检查:
SELECT
cast(14.0/19.0 as real) AS computed_result,
similarity('Chateau blanc', 'chateau cheval blanc') AS function_in_pg
你会看到你得到:0.736842
...它解释了如何计算相似性,以及为什么你得到你得到的值。
注意:您可以通过以下方式计算交集和并集:
SELECT
array_agg(t) AS in_common
FROM
(
SELECT unnest(show_trgm('Chateau blanc')) AS t
INTERSECT
SELECT unnest(show_trgm('chateau chevla blanc')) AS t
ORDER BY t
) AS trigrams_in_common ;
SELECT
array_agg(t) AS in_total
FROM
(
SELECT unnest(show_trgm('Chateau blanc')) AS t
UNION
SELECT unnest(show_trgm('chateau chevla blanc')) AS t
) AS trigrams_in_total ;
这是一种探索不同句子对相似性的方法:
WITH p AS
(
SELECT
'This is just a sentence I''ve invented'::text AS f1,
'This is just a sentence I''ve also invented'::text AS f2
),
t1 AS
(
SELECT unnest(show_trgm(f1)) FROM p
),
t2 AS
(
SELECT unnest(show_trgm(f2)) FROM p
),
x AS
(
SELECT
(SELECT count(*) FROM
(SELECT * FROM t1 INTERSECT SELECT * FROM t2) AS s0)::integer AS same,
(SELECT count(*) FROM
(SELECT * FROM t1 UNION SELECT * FROM t2) AS s0)::integer AS total,
similarity(f1, f2) AS sim_2
FROM
p
)
SELECT
same, total, same::real/total::real AS sim_1, sim_2
FROM
x ;
您可以在Rextester进行检查