java - 增量索引 lucene

Question

我正在使用 Lucene 3.6 在 Java 中制作应用程序，并希望提高速率。我已经创建了索引，我读到你要做的就是打开现有的索引，并检查每个文档的索引和文档修改日期，看看它们是否不同，删除索引文件并重新添加。我的问题是我不知道如何在 Java Lucene 中做到这一点。

谢谢

我的代码是：

public static void main(String[] args) 
    throws CorruptIndexException, LockObtainFailedException,
           IOException {

    File docDir = new File("D:\\PRUEBASLUCENE");
    File indexDir = new File("C:\\PRUEBA");

    Directory fsDir = FSDirectory.open(indexDir);
    Analyzer an = new StandardAnalyzer(Version.LUCENE_36);
    IndexWriter indexWriter
        = new IndexWriter(fsDir,an,MaxFieldLength.UNLIMITED);


    long numChars = 0L;
    for (File f : docDir.listFiles()) {
        String fileName = f.getName();
        Document d = new Document();
        d.add(new Field("Name",fileName,
                        Store.YES,Index.NOT_ANALYZED));
        d.add(new Field("Path",f.getPath(),Store.YES,Index.ANALYZED));
        long tamano = f.length();
        d.add(new Field("Size",""+tamano,Store.YES,Index.ANALYZED));
        long fechalong = f.lastModified();
        d.add(new Field("Modification_Date",""+fechalong,Store.YES,Index.ANALYZED));
        indexWriter.addDocument(d);
    }

    indexWriter.optimize();
    indexWriter.close();
    int numDocs = indexWriter.numDocs();

    System.out.println("Index Directory=" + indexDir.getCanonicalPath());
    System.out.println("Doc Directory=" + docDir.getCanonicalPath());
    System.out.println("num docs=" + numDocs);
    System.out.println("num chars=" + numChars);

}

谢谢 Edmondo1984，你帮了我很多。

最后我做了如下所示的代码。存储文件的哈希，然后检查修改日期。

在 9300 个索引文件中需要 15 秒，而重新索引（没有任何索引没有更改，因为没有文件）需要 15 秒。我做错了什么还是我可以优化代码以减少花费？

感谢 jtahlborn，我设法平衡了 indexReader 的创建和更新时间。您不应该更新现有索引应该更快地重新创建它吗？是否可以进一步优化代码？

if(IndexReader.indexExists(dir))
            {
                //reader is a IndexReader and is passed as parameter to the function
                //searcher is a IndexSearcher and is passed as parameter to the function
                term = new Term("Hash",String.valueOf(file.hashCode()));
                Query termQuery = new TermQuery(term);
                TopDocs topDocs = searcher.search(termQuery,1);
                if(topDocs.totalHits==1)
                {
                    Document doc;
                    int docId,comparedate;
                    docId=topDocs.scoreDocs[0].doc;
                    doc=reader.document(docId);
                    String dateIndString=doc.get("Modification_date");
                    long dateIndLong=Long.parseLong(dateIndString);
                    Date date_ind=new Date(dateIndLong);
                    String dateFichString=DateTools.timeToString(file.lastModified(), DateTools.Resolution.MINUTE);
                    long dateFichLong=Long.parseLong(dateFichString);
                    Date date_fich=new Date(dateFichLong);
                    //Compare the two dates
                    comparedates=date_fich.compareTo(date_ind);
                    if(comparedate>=0)
                    {
                        if(comparedate==0)
                        {
                            //If comparation is 0 do nothing
                            flag=2;
                        }
                        else
                        {
                            //if comparation>0 updateDocument
                            flag=1;
                        }
                    }

score 4 · Accepted Answer

根据 Lucene 数据模型，您将文档存储在索引中。在每个文档中，您将拥有要索引的字段，即所谓的“已分析”字段和未“已分析”的字段，您可以在其中存储时间戳和以后可能需要的其他信息。

我觉得您在文件和文档之间存在一定的混淆，因为在您的第一篇文章中您谈到了文档，现在您尝试调用 IndexFileNames.isDocStoreFile(file.getName()) 它实际上只告诉文件是否是包含的文件Lucene 索引。

如果你了解 Lucene 对象模型，编写你需要的代码大约需要三分钟：

您必须通过简单地查询 Lucene 来检查文档是否已经存在于索引中（例如，通过存储包含唯一标识符的未分析字段）。
如果您的查询返回 0 个文档，您会将新文档添加到索引中
如果您的查询返回 1 个文档，您将获得其“时间戳”字段并将其与您尝试存储的新文档之一进行比较。然后您可以使用文档的 docId 将其从索引中删除，如有必要，添加新的。

另一方面，如果您确定要始终修改以前的值，则可以参考 Lucene in Action 中的以下代码段：

public void testUpdate() throws IOException { 
    assertEquals(1, getHitCount("city", "Amsterdam"));
    IndexWriter writer = getWriter();
    Document doc = new Document();
    doc.add(new Field("id", "1",
    Field.Store.YES,
    Field.Index.NOT_ANALYZED));
    doc.add(new Field("country", "Netherlands",
    Field.Store.YES,
    Field.Index.NO));
    doc.add(new Field("contents",
    "Den Haag has a lot of museums",
    Field.Store.NO,
    Field.Index.ANALYZED));
    doc.add(new Field("city", "Den Haag",
    Field.Store.YES,
    Field.Index.ANALYZED));
    writer.updateDocument(new Term("id", "1"),
    doc);
    writer.close();
    assertEquals(0, getHitCount("city", "Amsterdam"));
    assertEquals(1, getHitCount("city", "Den Haag"));
}

如您所见，片段使用未分析的 ID，因为我建议保存可查询的简单属性，并使用 updateDocument 方法首先删除然后重新添加文档。

您可能想直接检查 javadoc

http://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/index/IndexWriter.html#updateDocument(org.apache.lucene.index.Term,org.apache.lucene.document。文件）

java - 增量索引 lucene

1 回答 1

Related

Reference