1

我有一个包含不同部分的大型文档。每个部分都有一个感兴趣的关键字/短语列表。我有一个存储为字符串数组的关键字/短语的主列表。如何使用 Solr 或 Lucene 在每个部分文档中搜索所有关键字并基本上告诉我找到了哪些关键字?我想不出任何直接的方法来实现这个......

谢谢

4

2 回答 2

1

基础开始

运行程序,您将了解 lucene 如何索引,这将有助于索引和搜索包含字段的文档

决定您的数据,需要如何存储这些字段。IE; DateFields 应存储为Field.Index.NOT_ANALYZED而不是 Field.Index.ANALYZED

现在下一步应该是

//indexmap ==>  HashMap  
//keywordfields ==> you master list of keywords/phrases
//selectfields ==> your document field (contained in lucene index)
String[] keywordfields = (String[]) indexmap.get("keywordfields").toString().split(",");
String[] selectFields = (String[]) indexmap.get("indexfields").toString().split(",");
//create a booleanquery
BooleanQuery bq = new BooleanQuery(); 
//iterate the keywordfields
for (int i = 0; i < keywordfields.length; i++) {
    bq.add(new BooleanClause(new TermQuery(new Term(keywordfields[i], (String)params.get(SEARCH_QUERYSTRING))),BooleanClause.Occur.SHOULD));
                }
//pass the boolean query object to the indexsearcher
 topDocs = indexSearcher.search(rq, 1000);
//get a reference to ScoreDoc
 ScoreDoc[] hits = topDocs.scoreDocs;
 //Iterate the hits

  Map <String, Object> resultMap = new HashMap<String, Object>();
  List<Map<String, String>> resultList = new ArrayList<Map<String, String>>();
                   for (ScoreDoc scoreDoc : hits) {
                    int docid = scoreDoc.doc;
                    FieldSelector fieldselector = new MapFieldSelector(selectFields);
                    Document doc = indexSearcher.doc(docid, fieldselector);

                    Map<String, String> searchMap = new HashMap<String, String>();
                    // get all fields for documents we got
                    List<Field> fields = doc.getFields();
                    for (Field field : fields) {
                        searchMap.put(field.name(), field.stringValue());
                        System.out.println("Field Name:" + field.name());
                        System.out.println("Field value:" + field.stringValue());
                    }
                    resultList.add(searchMap);
                    resultMap.put(TOTAL_RESULTS, hits.length);
                    resultMap.put(RS, resultList);
                }               
            } catch (Exception e) {
                e.printStackTrace();
            }

这应该是使用 Lucene 的实现之一 =]

于 2009-09-02T08:20:22.227 回答
0

听起来你只知道 Lucene 的分析功能。此功能的核心是Analyzer类。从文档中:

Analyzer 构建 TokenStreams,用于分析文本。因此,它代表了一种从文本中提取索引词的策略。

有很多Analyzer类可供选择,但StandardAnalyzer通常做得很好:

// For each chapter...

Reader reader = ...; // You are responsible for opening a reader for each chapter
Analyzer analyzer = new StandardAnalyzer();
TokenStream tokenStream = analyzer.tokenStream("", reader);

Token token = new Token();
while ((token = tokenStream.next(token)) != null) ) {
    String keyword = token.term();
    // You can now do whatever you wish with this keyword
}

您可能会发现其他分析仪会为您的目的做得更好。

于 2009-09-03T09:02:21.293 回答