sql - Oracle 文本中的忽略顺序包含搜索

Question

假设我的 Oracle Text 索引中有 2 行，例如：

Row 1 'John Smith Bristol South West'
Row 2 'John James Smith London South East'

进行以下搜索的最佳和最有效的方法是什么：

如果我提供搜索词：John Smith或Smith John，则应返回两行，但鉴于搜索词词更接近，第 1 行应具有更高的分数。
如果我提供搜索词：Joh Smit或Smit Jon，则应该返回两行，但考虑到搜索词词更接近，第 1 行应该有更高的分数。

目前，我的 SQL 看起来有点像这样：

SELECT display_value
     , score(1)
  FROM  my_indx_table
 WHERE contains ( search_tokens, '%' || replace(replace( :SEARCH_STRING, '_', '\_' ), '-', '\-') || '%', 1 ) > 0
ORDER BY score( 1 ) desc;

但它没有按我的意愿工作。

提前感谢您的帮助。

score 1 · Accepted Answer

欢迎来到 Oracle Text 搜索的黑暗、可怕世界。（您可能想阅读文档。）让我们从一些设置开始，这样我就可以复制您的问题。

create table my_indx_table (display_value number, search_tokens varchar2(100));
create index my_indx on my_indx_table (search_tokens) indextype is ctxsys.context;
insert into my_indx_table values (1, 'John Smith Bristol South West');
insert into my_indx_table values (2, 'John James Smith London South East');
commit;
exec ctx_ddl.sync_index(idx_name => 'MY_INDX');

好的，这是您的查询。它只返回第 1 行，因为该行的“John Smith”完全按照该顺序排列。

SELECT display_value, score(1)
  FROM  my_indx_table
 WHERE contains ( search_tokens, '%' || replace(replace( 'John Smith', '_', '\_' ), '-', '\-') || '%', 1 ) > 0
ORDER BY score( 1 ) desc;

DISPLAY_VALUE   SCORE(1)
------------- ----------
            1          3

如果您想使用单个 CONTAINS 调用一次执行多种搜索，您可能需要使用Query Templates。

下一个示例使用查询重写和查询松弛。它首先尝试确切的短语“John Smith”，然后搜索彼此靠近的两个词。

SELECT display_value, score(1)
  FROM  my_indx_table
 WHERE contains ( search_tokens, 
'<query>
<textquery lang="ENGLISH" grammar="CONTEXT">' || 'John Smith' || '
 <progression>
   <seq><rewrite>transform((TOKENS, "{", "}", " "))</rewrite></seq>
   <seq><rewrite>transform((TOKENS, "{", "}", " NEAR "))</rewrite></seq>
 </progression>
</textquery>
<score datatype="FLOAT" algorithm="COUNT"/>
</query>', 
    1 ) > 0
ORDER BY score( 1 ) desc;

DISPLAY_VALUE   SCORE(1)
------------- ----------
            1       50.5
            2     6.8908

第 1 行比第 2 行得分更高，主要是因为它包含确切的短语。如果您删除第一<seq></seq>行（或尝试“Smith John”），您会注意到两行的NEAR 分数非常相似，尽管距离不同。默认分数数据类型是整数，因此第 1 行和第 2 行都会四舍五入到相同的分数，即 14。可能不是您想要的。（我认为这样做的原因是 Oracle Text 主要用于索引大块文本，如文档或书籍。对于这样的短语，它的评分有点奇怪。）

现在让我们看看模糊搜索，以解决拼写错误。这个函数的默认相似度分数是 60，但我把它降低到 50，所以它会选择 Smit=Smith。

SELECT display_value, score(1)
  FROM  my_indx_table
 WHERE contains ( search_tokens, 
'<query>
<textquery lang="ENGLISH" grammar="CONTEXT">' || 'Joh Smit' || '
 <progression>
   <seq><rewrite>transform((TOKENS, "{", "}", " "))</rewrite></seq>
   <seq><rewrite>transform((TOKENS, "{", "}", " NEAR "))</rewrite></seq>
   <seq><rewrite>transform((TOKENS, "fuzzy(", ", 50)", " "))</rewrite></seq>
   <seq><rewrite>transform((TOKENS, "fuzzy(", ", 50)", " NEAR "))</rewrite></seq>
 </progression>
</textquery>
<score datatype="FLOAT" algorithm="COUNT"/>
</query>', 
    1 ) > 0
ORDER BY score( 1 ) desc;

DISPLAY_VALUE   SCORE(1)
------------- ----------
            1      25.25
            2     3.4454

很简单，我想。这里主要令人困惑的事情可能是查询重写语法。但是您可以对fuzzy运算符进行很多调整，以使其与您正在处理的特定搜索一起工作。

sql - Oracle 文本中的忽略顺序包含搜索

1 回答 1

Related

Reference