今天的挑战是为我商店的产品数据库创建一个搜索引擎。
很多产品都是手工制作的,很多不同的手!
所以很可能会找到“i-phone 3gs”、“iPhone4”和“i phone 5”,
我想要的是搜索“iPhone”并找到上面的三个示例产品结果。
这让我想起了“模糊搜索”。我试图开箱即用地使用它们,但没有成功。
我必须索引和搜索这种示例(文档正文中的特殊字符或空格)以检索“同义词”结果?
例如
iPhone =>“我的电话”
“特殊 40” => “特殊 40”
今天的挑战是为我商店的产品数据库创建一个搜索引擎。
很多产品都是手工制作的,很多不同的手!
所以很可能会找到“i-phone 3gs”、“iPhone4”和“i phone 5”,
我想要的是搜索“iPhone”并找到上面的三个示例产品结果。
这让我想起了“模糊搜索”。我试图开箱即用地使用它们,但没有成功。
我必须索引和搜索这种示例(文档正文中的特殊字符或空格)以检索“同义词”结果?
例如
iPhone =>“我的电话”
“特殊 40” => “特殊 40”
Using Lucene, there are a couple of options I would recommend.
One would be to index product ids with a KeywordAnalyzer, and then query as you suggested, with a fuzzy query.
Or, you could create a custom Analyzer, in which you add a WordDelimiterFilter
, which will create tokens at changes in case, as well as dashes and spaces (if any exist in your tokens after having been passed through the tokenizer). An important note, if you are using a StandardAnalyzer, or SimpleAnalyzer, or something similar, you will want to make sure the WordDelimiterFilter
is applied BEFORE the LowercaseFilter
. Running it through the LowercaseFilter
would, of course, prevent it being able to split terms based on camel casing. Another caution, you'll probably want to customize your StopFilter, since "I" is a common english stopword.
In a custom analyzer, you mainly just need to override createComponents()
. For example, if you wanted to add WordDelimiterFilter
functionality into the StandardAnalyzer's set of filters:
@Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
Tokenizer tokenizer = new StandardTokenizer(Version.LUCENE_40,reader);
TokenStream filter = new StandardFilter(Version.LUCENE_40,tokenizer);
//Take a look at the WordDelimiterFactory API for other options on this filter's behavior
filter = new WordDelimiterFilter(filter,WordDelimiterFilter.GENERATE_WORD_PARTS,null);
filter = new LowercaseFilter(Version.LUCENE_40,filter);
//As mentioned, create a CharArraySet of your stopwords, since the default will likely cause problems for you
filter = new StopFilter(Version.LUCENE_40,filter,myStopWords);
return new TokenStreamComponents(tokenizer, filter);
}
使用 Solr,请确保浏览示例教程和相应的 schema.xml。您会看到那里有两种类型定义(我认为是 en_splitting 和 en_splitting_tight),它们显示了非常相似的用例。
具体来说,您正在查看由 LowerCaseFilter 和可能的SynonymFilter增强的 WordDelimiterFilter。不过,您必须对 SynonymFilters 小心一点,尤其是当您映射到/来自多词等价时。