0

我想结合 PostgreSQL Levenshtein 和 trigram 相似函数。三元相似度函数的主要优点是它可以利用 GIN 或 GIST 索引,因此可以快速返回模糊匹配结果。但是,如果在另一个函数中调用它,它不会使用索引。为了说明这个问题,这里有一个调用原始trigram的“相似性”函数的plpgsql函数“trigram_similarity”。

CREATE OR REPLACE FUNCTION public.trigram_similarity(
    left_string text,
    right_string text)
  RETURNS real AS
$BODY$
BEGIN
  RETURN similarity(left_string, right_string);
END;$BODY$
  LANGUAGE plpgsql IMMUTABLE STRICT
  COST 100;
ALTER FUNCTION public.trigram_similarity(text, text)
  OWNER TO postgres;

尽管该函数实际上只是调用了 trigram 的相似度函数,但在 GIN 索引的使用方面表现完全不同。虽然查询的 WHERE 子句中原始 trigram 的相似性函数确实使用了 GIN 索引,因此查询可以快速返回结果并且没有太多 RAM 消耗,但在使用 trigram_similarity 时却没有。对于大型数据集模糊匹配分析,RAM 被完全使用并且应用程序冻结......为了说明,这里是一个示例查询:

SELECT DISTINCT  
trigram_similarity(l.l_composite_18, r.r_composite_18)
::numeric(5,4) AS trigram_similarity_composite_score 
, (trigram_similarity(l."Name", r."Name")*(0.166666666666667) 
+ trigram_similarity(l."LastName", r."Surname")*(0.0833333333333333) 
+ trigram_similarity(l."County", r."District")*(0.0416666666666667) 
+ trigram_similarity(l."Town", r."Location")*(0.0416666666666667) 
+ trigram_similarity(l."PostalCode", r."ZipCode")*(0.0416666666666667) 
+ trigram_similarity(l."PostOffice", r."PostOffice")*(0.0416666666666667) 
+ trigram_similarity(l."Street", r."Road")*(0.0416666666666667) 
+ trigram_similarity(l."Number", r."HomeNumber")*(0.0416666666666667) 
+ trigram_similarity(l."Telephone1", r."Phone1")*(0.0416666666666667) 
+ trigram_similarity(l."Telephone2", r."Phone2")*(0.0416666666666667) 
+ trigram_similarity(l."EMail", r."EMail")*(0.0416666666666667) 
+ trigram_similarity(l."BirthDate", r."DateOfBirth")*(0.166666666666667) 
+ trigram_similarity(l."Gender", r."Sex")*(0.208333333333333) 
)
::numeric(5,4) AS trigram_similarity_weighted_score  
, l."ClanID" AS "l_ClanID_1"
, l."Name" AS "l_Name_2"
, l."LastName" AS "l_LastName_3"
, l."County" AS "l_County_4"
, l."Town" AS "l_Town_5"
, l."PostalCode" AS "l_PostalCode_6"
, l."PostOffice" AS "l_PostOffice_7"
, l."Street" AS "l_Street_8"
, l."Number" AS "l_Number_9"
, l."Telephone1" AS "l_Telephone1_10"
, l."Telephone2" AS "l_Telephone2_11"
, l."EMail" AS "l_EMail_12"
, l."BirthDate" AS "l_BirthDate_13"
, l."Gender" AS "l_Gender_14"
, l."Aktivan" AS "l_Aktivan_15"
, l."ProgramCode" AS "l_ProgramCode_16"
, l."Card" AS "l_Card_17"
, l."DateOfCreation" AS "l_DateOfCreation_18"
, l."Assigned" AS "l_Assigned_19"
, l."Reserved" AS "l_Reserved_20"
, l."Sent" AS "l_Sent_21"
, l."MemberOfBothPrograms" AS "l_MemberOfBothPrograms_22"
, r."ClanID" AS "r_ClanID_23"
, r."Name" AS "r_Name_24"
, r."Surname" AS "r_Surname_25"
, r."District" AS "r_District_26"
, r."Location" AS "r_Location_27"
, r."ZipCode" AS "r_ZipCode_28"
, r."PostOffice" AS "r_PostOffice_29"
, r."Road" AS "r_Road_30"
, r."HomeNumber" AS "r_HomeNumber_31"
, r."Phone1" AS "r_Phone1_32"
, r."Phone2" AS "r_Phone2_33"
, r."EMail" AS "r_EMail_34"
, r."DateOfBirth" AS "r_DateOfBirth_35"
, r."Sex" AS "r_Sex_36"
, r."Active" AS "r_Active_37"
, r."ProgramCode" AS "r_ProgramCode_38"
, r."CardNo" AS "r_CardNo_39"
, r."CreationDate" AS "r_CreationDate_40"
, r."Assigned" AS "r_Assigned_41"
, r."Reserved" AS "r_Reserved_42"
, r."Sent" AS "r_Sent_43"
, r."BothPrograms" AS "r_BothPrograms_44"
 FROM  "l_leftdatasetexample3214274191" AS l, "r_rightdatasetexample3214274191" AS r
 WHERE  
((l."Gender"=r."Sex") AND (l."Card"<>r."CardNo") AND (l."Name"=r."Name"))
 AND 
((l."ProgramCode"= '1') AND (r."ProgramCode"= '1'))
 AND 
(((l.l_composite_18 % r.r_composite_18)
)
 OR (((l."Name" % r."Name")
OR (l."LastName" % r."Surname")
OR (l."County" % r."District")
OR (l."Town" % r."Location")
OR (l."PostalCode" % r."ZipCode")
OR (l."PostOffice" % r."PostOffice")
OR (l."Street" % r."Road")
OR (l."Number" % r."HomeNumber")
OR (l."Telephone1" % r."Phone1")
OR (l."Telephone2" % r."Phone2")
OR (l."EMail" % r."EMail")
OR (l."BirthDate" % r."DateOfBirth")
OR (l."Gender" % r."Sex"))
)
) AND ((trigram_similarity(l.l_composite_18, r.r_composite_18)
 >= 0.7) 
 OR ((trigram_similarity(l."Name", r."Name")*(0.166666666666667) 
+ trigram_similarity(l."LastName", r."Surname")*(0.0833333333333333) 
+ trigram_similarity(l."County", r."District")*(0.0416666666666667) 
+ trigram_similarity(l."Town", r."Location")*(0.0416666666666667) 
+ trigram_similarity(l."PostalCode", r."ZipCode")*(0.0416666666666667) 
+ trigram_similarity(l."PostOffice", r."PostOffice")*(0.0416666666666667) 
+ trigram_similarity(l."Street", r."Road")*(0.0416666666666667) 
+ trigram_similarity(l."Number", r."HomeNumber")*(0.0416666666666667) 
+ trigram_similarity(l."Telephone1", r."Phone1")*(0.0416666666666667) 
+ trigram_similarity(l."Telephone2", r."Phone2")*(0.0416666666666667) 
+ trigram_similarity(l."EMail", r."EMail")*(0.0416666666666667) 
+ trigram_similarity(l."BirthDate", r."DateOfBirth")*(0.166666666666667) 
+ trigram_similarity(l."Gender", r."Sex")*(0.208333333333333) 
)
 >= 0.7)
 ) ORDER  BY trigram_similarity_composite_score DESC;

此查询会导致 RAM 堵塞和应用程序冻结。将“trigram_similarity”替换为“similarity”时,查询执行速度快且不会过度消耗 RAM。为什么“trigram_similarity”和“similarity”表现不同?有没有一种方法可以强制使用此“trigram_similarity”函数或内部调用 trigram 相似度函数的任何其他函数的 GIN 或 GIST 索引?

解释分析何时使用“相似度”:

"Unique  (cost=170717.94..177633.17 rows=58853 width=383) (actual time=260362.193..260362.279 rows=99 loops=1)"
"  Output: ((similarity((l.l_composite_18)::text, (r.r_composite_18)::text))::numeric(5,4)), (((((((((((((((similarity((l."Name")::text, (r."Name")::text) * '0.166666666666667'::double precision) + (similarity((l."LastName")::text, (r."Surname")::text) * '0 (...)"
"  Buffers: shared hit=2513871 read=4158"
"  ->  Sort  (cost=170717.94..170865.07 rows=58853 width=383) (actual time=260362.192..260362.198 rows=99 loops=1)"
"        Output: ((similarity((l.l_composite_18)::text, (r.r_composite_18)::text))::numeric(5,4)), (((((((((((((((similarity((l."Name")::text, (r."Name")::text) * '0.166666666666667'::double precision) + (similarity((l."LastName")::text, (r."Surname")::text (...)"
"        Sort Key: ((similarity((l.l_composite_18)::text, (r.r_composite_18)::text))::numeric(5,4)) DESC, (((((((((((((((similarity((l."Name")::text, (r."Name")::text) * '0.166666666666667'::double precision) + (similarity((l."LastName")::text, (r."Surname" (...)"
"        Sort Method: quicksort  Memory: 76kB"
"        Buffers: shared hit=2513871 read=4158"
"        ->  Nested Loop  (cost=0.29..155793.36 rows=58853 width=383) (actual time=1851.503..260361.609 rows=99 loops=1)"
"              Output: (similarity((l.l_composite_18)::text, (r.r_composite_18)::text))::numeric(5,4), ((((((((((((((similarity((l."Name")::text, (r."Name")::text) * '0.166666666666667'::double precision) + (similarity((l."LastName")::text, (r."Surname")::t (...)"
"              Buffers: shared hit=2513871 read=4158"
"              ->  Seq Scan on public.r_rightdatasetexample3214274191 r  (cost=0.00..11228.86 rows=101669 width=188) (actual time=9.149..67.134 rows=50837 loops=1)"
"                    Output: r."ClanID", r."Name", r."Surname", r."District", r."Location", r."ZipCode", r."PostOffice", r."Road", r."HomeNumber", r."Phone1", r."Phone2", r."EMail", r."DateOfBirth", r."Sex", r."Active", r."ProgramCode", r."CardNo", r."Creat (...)"
"                    Filter: ((r."ProgramCode")::text = '1'::text)"
"                    Buffers: shared hit=5800 read=4158"
"              ->  Index Scan using "idxbNameA8D72F00099E4B70885B2E0BB1DFB684l_leftdatasetexample321" on public.l_leftdatasetexample3214274191 l  (cost=0.29..1.35 rows=1 width=195) (actual time=5.111..5.119 rows=0 loops=50837)"
"                    Output: l."ClanID", l."Name", l."LastName", l."County", l."Town", l."PostalCode", l."PostOffice", l."Street", l."Number", l."Telephone1", l."Telephone2", l."EMail", l."BirthDate", l."Gender", l."Aktivan", l."ProgramCode", l."Card", l."D (...)"
"                    Index Cond: ((l."Name")::text = (r."Name")::text)"
"                    Filter: (((l."ProgramCode")::text = '1'::text) AND ((l."Card")::text <> (r."CardNo")::text) AND ((r."Sex")::text = (l."Gender")::text) AND (((l.l_composite_18)::text % (r.r_composite_18)::text) OR ((l."Name")::text % (r."Name")::text) O (...)"
"                    Rows Removed by Filter: 50"
"                    Buffers: shared hit=2508071"
"Planning time: 13.885 ms"
"Execution time: 260362.730 ms"
4

2 回答 2

1

索引是在表列上创建的。您需要修改 plpgsql 函数来查询 GIN 或 GIST 索引的表列,而不是比较两个字符串文字。如果您比较两个字符串文字,插件没有要命中的索引,并且必须在比较它们之前将两个字符串分解为它们的三元组,这是您的问题。

http://www.postgresql.org/docs/9.1/static/pgtrgm.html

于 2016-05-10T19:51:20.540 回答
0

可以在复合(连接)表达式上创建 GIN trgm_ops 表达式索引。该指标可以方便%相似度算子,但不能方便相似度函数。

于 2016-11-06T13:19:19.757 回答