3

在一个非常典型的场景中,我的 Web 应用程序上有一个“搜索”文本框,该文本框将用户输入直接传递给存储过程,然后使用全文索引搜索两个表中的两个字段,这些字段使用适当的键连接。

我正在使用 CONTAINS 谓词来搜索字段。在传递搜索字符串之前,我执行以下操作:

SET @ftQuery = '"' + REPLACE(@query,' ', '*" OR "') + '*"'

例如,将城堡更改为"the*" OR "castle*"。这是必要的,因为我希望人们能够搜索cas并获得castle的结果。

WHERE CONTAINS(Building.Name, @ftQuery) OR CONTAINS(Road.Name, @ftQuery)

问题是,既然我已经在每个单词的末尾附加了一个通配符,干扰词(例如the)也附加了一个通配符,因此似乎不再被丢弃。这意味着搜索城堡将返回带有诸如剧院等单词的项目。

将 OR 更改为 AND 是我的第一个想法,但如果随后在查询中使用干扰词,这似乎只是不返回任何匹配项。

我想要实现的只是允许用户输入多个空格分隔的单词,这些单词代表他们正在搜索的单词的全部或前缀,以任何顺序 - 并从他们的输入中删除诸如the 之类的噪声词(否则当他们搜索城堡时,他们会得到一个很大的项目列表,他们需要的结果位于列表中间的某个位置。

我可以继续实现我自己的噪声词去除程序,但这似乎是全文索引应该能够处理的事情。

感谢您的帮助!

杰米

4

5 回答 5

5

在存储索引之前去除噪声词。因此,不可能编写一个搜索停用词的查询。如果您真的想要启用此行为,则需要编辑停用词列表。( http://msdn.microsoft.com/en-us/library/ms142551.aspx ) 然后重新构建索引。

于 2009-01-28T17:46:42.713 回答
2

我有同样的问题,经过彻底的搜索,我得出的结论是没有好的解决方案。

作为妥协,我正在实施蛮力解决方案:

1) 打开 C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\FTData\noiseENU.txt 并复制其中的所有文本。

2)粘贴到应用程序的代码文件中,用“,”替换换行符以获得这样的列表初始化程序:

public static List<string> _noiseWords = new List<string>{ "about", "1", "after", "2", "all", "also", "3", "an", "4", "and", "5", "another", "6", "any", "7", "are", "8", "as", "9", "at", "0", "be", "$", "because", "been", "before", "being", "between", "both", "but", "by", "came", "can", "come", "could", "did", "do", "does", "each", "else", "for", "from", "get", "got", "has", "had", "he", "have", "her", "here", "him", "himself", "his", "how", "if", "in", "into", "is", "it", "its", "just", "like", "make", "many", "me", "might", "more", "most", "much", "must", "my", "never", "no", "now", "of", "on", "only", "or", "other", "our", "out", "over", "re", "said", "same", "see", "should", "since", "so", "some", "still", "such", "take", "than", "that", "the", "their", "them", "then", "there", "these", "they", "this", "those", "through", "to", "too", "under", "up", "use", "very", "want", "was", "way", "we", "well", "were", "what", "when", "where", "which", "while", "who", "will", "with", "would", "you", "your", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z" };

3)在提交搜索字符串之前,将其分解为单词并删除噪声单词中的任何单词,如下所示:

List<string> goodWords = new List<string>();
string[] words = searchString.Split(' ');
foreach (string word in words)
{
   if (!_noiseWords.Contains(word))
      goodWords.Add(word);
}

不是一个理想的解决方案,但只要干扰词文件不改变就应该可以工作。多语言支持将使用按语言列出的字典。

于 2009-01-30T00:12:39.790 回答
1

这是一个工作函数。该文件noiseENU.txt按原样从\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\FTData.

    Public Function StripNoiseWords(ByVal s As String) As String
        Dim NoiseWords As String = ReadFile("/Standard/Core/Config/noiseENU.txt").Trim
        Dim NoiseWordsRegex As String = Regex.Replace(NoiseWords, "\s+", "|") ' about|after|all|also etc.
        NoiseWordsRegex = String.Format("\s?\b(?:{0})\b\s?", NoiseWordsRegex)
        Dim Result As String = Regex.Replace(s, NoiseWordsRegex, " ", RegexOptions.IgnoreCase) ' replace each noise word with a space
        Result = Regex.Replace(Result, "\s+", " ") ' eliminate any multiple spaces
        Return Result
    End Function
于 2010-01-22T22:10:53.073 回答
1

您还可以在进行查询之前删除干扰词。语言 ID 列表:http: //msdn.microsoft.com/en-us/library/ms190303.aspx

将 queryTextWithoutNoise 调暗为 String = removeNoiseWords(queryText, ConnectionString, 1033)

公共函数 removeNoiseWords(ByVal inputText As String, ByVal cnStr As String, ByVal languageID As Integer) As String

    Dim r As New System.Text.StringBuilder
    Try
        If inputText.Contains(CChar("""")) Then
            r.Append(inputText)
        Else
            Using cn As New SqlConnection(cnStr)

                Const q As String = "SELECT display_term,special_term FROM sys.dm_fts_parser(@q,@l,0,0)"
                cn.Open()
                Dim cmd As New SqlCommand(q, cn)
                With cmd.Parameters
                    .Add(New SqlParameter("@q", """" & inputText & """"))
                    .Add(New SqlParameter("@l", languageID))
                End With
                Dim dr As SqlDataReader = cmd.ExecuteReader
                While dr.Read
                    If Not (dr.Item("special_term").ToString.Contains("Noise")) Then
                        r.Append(dr.Item("display_term").ToString)
                        r.Append(" ")
                    End If
                End While
            End Using
        End If
    Catch ex As Exception
        ' ...        
    End Try
    Return r.ToString

End Function
于 2010-10-25T10:26:20.187 回答
0

类似于我的方法。

虽然我希望使用全文索引来执行词干提取、速度和多词搜索等,但实际上我只在两个表中索引了几个 nvarchar(100) 字段。每个表很容易保持在 50,000 行以下。

我的解决方案是从文本文件中删除所有干扰词,并允许索引器编译包含所有词的索引。它仍然只包含几千个条目。

然后,我按照我的原始帖子中的描述对搜索字符串中的空格进行替换,以使 CONTAINS 能够处理多个单词,并分别对单词进行词干处理。

似乎工作得很好,但我会密切关注性能。

于 2009-02-04T14:43:09.157 回答