6

Lucene-Core 4.0 的发行说明提到了一个值得注意的变化:

• 新的“块”PostingsFormat 提供改进的搜索性能和索引压缩。这可能会成为未来版本中的默认格式。

根据这篇文,BlockPostingsFormat 产生的索引更小,并且比之前的格式更快(对于大多数查询)。

但是,我找不到任何地方提到如何在 4.0 中选择这种格式。在哪里可以指定新的 BlockPostingsFormat 优先于旧的默认值?

4

1 回答 1

4

几个步骤:

  1. 选择一个编解码器。然后“修改”它以使用BlockPostingsFormat作为 PostingFormat 类。您可以扩展编解码器的类,或使用FilterCodec,它可以让您覆盖编解码器的某些设置。
  2. 在 META-INF/services/org.apache.lucene.codecs.Codec 创建一个文件。它应该列出您在上一步中创建的编解码器类的完整类名。这是为了满足 Lucene 4 加载编解码器的方式。
  3. 调用IndexWriterConfig.setCodec(Codec)来指定您刚刚创建的编解码器。
  4. 像往常一样使用 IndexWriterConfig 对象。

根据 Javadoc,BlockPostingsFormat 在 index 目录中创建 .doc 和 .pos 文件,而 Lucene40PostingsFormat 创建 .frq 和 .prx 文件。所以这是判断 Lucene 是否真的使用块发布格式的一种方法。

我修改了 Lucene 核心 Javadoc 中的示例来测试块发布格式。这是代码(希望对您有所帮助):


org.apache.lucene.codecs.Codec

# See http://www.romseysoftware.co.uk/2012/07/04/writing-a-new-lucene-codec/
# This file should be in /somewhere_in_your_classpath/META-INF/services/org.apache.lucene.codecs.Codec
# 
# List of codecs
lucene4examples.Lucene40WithBlockCodec

Lucene40WithBlockCodec.java

package lucene4examples;

import org.apache.lucene.codecs.FilterCodec;
import org.apache.lucene.codecs.PostingsFormat;
import org.apache.lucene.codecs.block.BlockPostingsFormat;
import org.apache.lucene.codecs.lucene40.Lucene40Codec;

// Lucene 4.0 codec with block posting format

public class Lucene40WithBlockCodec extends FilterCodec {

    public Lucene40WithBlockCodec() {
    super("Lucene40WithBlock", new Lucene40Codec());

    }

    @Override
    public PostingsFormat postingsFormat() {
    return new BlockPostingsFormat();
    }

}

BlockPostingsFormatExample.java

package lucene4examples;

import java.io.File;
import java.io.IOException;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

// This example is based on the one that comes with Lucene 4.0.0 core API Javadoc
// (http://lucene.apache.org/core/4_0_0/core/overview-summary.html)

public class BlockPostingsFormatExample {

    public static void main(String[] args) throws IOException, ParseException {
    Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);

    // Store the index on disk:
    Directory directory = FSDirectory.open(new File(
        "/index_dir"));
    IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40,
        analyzer);

    // If the following line of code is commented out, the original Lucene
    // 4.0 codec will be used.
    // Else, the Lucene 4.0 codec with block posting format
    // (http://blog.mikemccandless.com/2012/08/lucenes-new-blockpostingsformat-thanks.html)
    // will be used.
    config.setCodec(new Lucene40WithBlockCodec());

    IndexWriter iwriter = new IndexWriter(directory, config);
    Document doc = new Document();
    String text = "This is the text to be indexed.";
    doc.add(new Field("fieldname", text, TextField.TYPE_STORED));
    iwriter.addDocument(doc);
    iwriter.close();

    // Now search the index:
    DirectoryReader ireader = DirectoryReader.open(directory);
    IndexSearcher isearcher = new IndexSearcher(ireader);
    // Parse a simple query that searches for "text":
    QueryParser parser = new QueryParser(Version.LUCENE_40, "fieldname",
        analyzer);
    Query query = parser.parse("text");
    ScoreDoc[] hits = isearcher.search(query, null, 1000).scoreDocs;
    System.out.println("hits.length = " + hits.length);
    // Iterate through the results:
    for (int i = 0; i < hits.length; i++) {
        Document hitDoc = isearcher.doc(hits[i].doc);
        System.out.println("text: " + hitDoc.get("fieldname"));
    }
    ireader.close();
    directory.close();
    }

}
于 2012-10-22T21:41:36.697 回答