11

We have an application that using a SQL Server 2008 database, and full-text search. I'm trying to understand why the following searches behave differently:

First, a phrase containing a hyphenated word, like this:

contains(column_name, '"one two-three-four five"')

And second, an identical phrase, where the hyphens are replaced by spaces:

contains(column_name, '"one two three four five"')

The full-text index uses the ENGLISH (1033) locale, and the default system stoplist.

From my observations of other full-text searches containing hyphenated words, the first one should allow for matches on either one two three four five or one twothreefour five. Instead, it only matches one twothreefour five (and not one two-three-four five).


Test Case

Setup:

create table ftTest 
(
    Id int identity(1,1) not null, 
    Value nvarchar(100) not null, 
    constraint PK_ftTest primary key (Id)
);

insert ftTest (Value) values ('one two-three-four five');
insert ftTest (Value) values ('one twothreefour five');

create fulltext catalog ftTest_catalog;
create fulltext index on ftTest (Value language 1033)
    key index PK_ftTest on ftTest_catalog;
GO

Queries:

--returns one match
select * from ftTest where contains(Value, '"one two-three-four five"')

--returns two matches
select * from ftTest where contains(Value, '"one two three four five"')
select * from ftTest where contains(Value, 'one and "two-three-four five"')
select * from ftTest where contains(Value, '"one two-three-four" and five')
GO

Cleanup:

drop fulltext index on ftTest
drop fulltext catalog ftTest_catalog;
drop table ftTest;
4

3 回答 3

10

在这种情况下,您无法预测断词器的行为,最好在字符串上运行 sys.dm_fts_parser 以了解单词将如何被拆分并存储在内部索引中。

例如,在“一二三四五”上运行 sys.dm_fts_parser 会导致以下结果 -

select * from sys.dm_fts_parser('"one two-three-four five"', 1033, NULL, 0)
--edited--
1   0   1   Exact Match one
1   0   2   Exact Match two-three-four
1   0   2   Exact Match two
1   0   3   Exact Match three
1   0   4   Exact Match four
1   0   5   Exact Match five

从返回的结果可以看出,分词器解析字符串并输出六种形式,这些形式可以解释您在运行 CONTAINS 查询时看到的结果。

于 2012-09-19T22:45:29.273 回答
10

http://support.microsoft.com/default.aspx?scid=kb;en-us;200043

“如果在搜索条件中必须使用非字母数字字符(主要是短划线 '-' 字符),请使用 Transact-SQL LIKE 子句而不是 FULLTEXT 或 CONTAINS 谓词。”

于 2012-07-25T07:38:57.230 回答
3

全文搜索将单词视为没有空格或标点符号的字符串。非字母数字字符的出现可能会在搜索过程中“打断”单词。因为 SQL Server 全文搜索是基于词的引擎,所以在搜索索引时一般不考虑标点符号并忽略。因此,像 'CONTAINS(testing, "computer-failure")' 这样的 CONTAINS 子句将匹配具有值“找不到我的计算机会很昂贵”的行。

请点击链接了解为什么: https: //support.microsoft.com/en-us/kb/200043

于 2015-09-17T05:48:54.403 回答