3

I have a fairly large lucene index, and queries that can hit about 5000 documents or so. I am storing my application metadata in a field in lucene (apart from text contents), and need to quickly get to this small metadata field for all the 5000 hits. Currently, my code looks something like this:

MapFieldSelector field = new MapFieldSelector("metaData");
ScoreDoc[] hits = searcher.search(query, null, 10000).scoreDocs;
for (int i = 0; i < hits.length; i++) {
    int index_doc_id = hits[i].doc;
    Document hitDoc = searcher.doc(index_doc_id, field); // expensive esp with disk-based lucene index
    metadata = hitDoc.getFieldable("metaData").stringValue();
}

However, this is terribly slow because each call to searcher.doc() is pretty expensive. Is there a way to do a "batch" fetch of the field for all the hits that may be more responsive? Or any other way to make this work faster? (the only thing inside the ScoreDoc appears to be the Lucene doc id, which I understand should not be relied upon. Otherwise I would have maintained a Lucene doc id -> metadata map on my own.) Thanks!

Update: I am now trying to use FieldCache's like this:

String metadatas[] = org.apache.lucene.search.FieldCache.DEFAULT.getStrings(searcher.getIndexReader(), "metaData");

when I open the index, and upon a query:

int ldocId = hits[i].doc;
String metadata = metadatas[ldocId]; 

This is working well for me.

4

1 回答 1

1

提高性能最好的办法是尽可能减少存储的数据。如果您在索引中存储了一个大的内容字段,将其设置为仅索引而不是存储将提高您的性能。将内容存储在 Lucene 外部,以便在索引中找到命中后获取,这通常是一个更好的主意。

也有可能存在更好的方法来获得您正在寻找的最终结果。我猜这 5000 组元数据并不是这里的最终结果。Lucene 中的索引数据可以更轻松地处理您的分析,而不是先将其全部从索引中提取出来。不知道,根据您提供的内容,这在您的情况下是否可行,但肯定值得一看。

于 2013-05-21T22:08:56.760 回答