我尝试使用 lucene 4.3.1 实现基于索引的文本搜索。代码如下。我使用 au NGramTokenyzer 创建了索引,因为我想找到对于 FuzzyQuery 来说太远的搜索结果。我的解决方案有两个问题。首先是我不明白为什么它会找到一些东西,而另一些却没有。例如,如果我寻找“Buter”、“utter”或“Bute”,它会找到“Butter”,但如果我寻找“Btter”则没有结果。我的实现是否有错误,我应该做些什么不同?此外,我希望它总是为每个查询提供(例如)10 个结果。这是否可以通过我的代码实现,或者我需要更改哪些内容才能获得这 10 个结果?
这是代码:
public LuceneIndex() throws IOException{
File dir = new File(indexDirectoryPath);
index = FSDirectory.open(dir);
analyzer = new NGramAnalyzer();
config = new IndexWriterConfig(luceneVersion, analyzer);
indexWriter = new IndexWriter(index, config);
reader = DirectoryReader.open(FSDirectory.open(dir));
searcher = new IndexSearcher(reader);
queryParser = new QueryParser(luceneVersion, "label", new NGramAnalyzer());
}
/**
* building the index
* @param graph
* @throws IOException
*/
public void makeIndex(MyGraph graph) throws IOException {
FieldType fieldType = new FieldType();
fieldType.setTokenized(true);
//read the items that should be indexed
ArrayList<String> DbList = Helper.readListFromFileDb(indexFilePath);
for (String word : DbList) {
Document doc = new Document();
doc.add(new TextField("label", word, Field.Store.YES));
indexWriter.addDocument(doc);
}
indexWriter.close();
}
public void searchIndexWithQueryParser(String searchString, int numberOfResults) throws IOException, ParseException {
System.out.println("Searching for '" + searchString + "' using QueryParser");
Query query = queryParser.parse(searchString);
System.out.println(query.toString());
TopDocs results = searcher.search(query, numberOfResults);
ScoreDoc[] hits = results.scoreDocs;
//just to see some output...
int i = 0;
Document doc = searcher.doc(hits[i].doc);
String label = doc.get("label");
System.out.println(label);
}
编辑:NGramAnalyzer 的代码
public class NGramAnalyzer extends Analyzer {
int minGram = 2;
int maxGram = 2;
@Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
Tokenizer source = new NGramTokenizer(reader, minGram, maxGram);
CharArraySet charArraySet = StopFilter.makeStopSet(Version.LUCENE_43,
FoodProductBlackList.blackList, true);
TokenStream filter = new StopFilter(Version.LUCENE_43, source, charArraySet);
return new TokenStreamComponents(source, filter);
}
}