python - PyLucene 索引器和检索器示例

Question

我是 Lucene 的新手。我想在 Python 3 中编写 PyLucene 6.5 的示例代码。我更改了该版本的示例代码。但是，我可以找到很少的文件，我不确定更改是否正确。

# indexer.py
import sys
import lucene

from java.io import File
from org.apache.lucene.analysis.standard import StandardAnalyzer
from org.apache.lucene.document import Document, Field, StringField, FieldType
from org.apache.lucene.index import IndexWriter, IndexWriterConfig
from org.apache.lucene.store import SimpleFSDirectory, FSDirectory
from org.apache.lucene.util import Version

if __name__ == "__main__":
    lucene.initVM()
    indexPath = File("index/").toPath()
    indexDir = FSDirectory.open(indexPath)
    writerConfig = IndexWriterConfig(StandardAnalyzer())
    writer = IndexWriter(indexDir, writerConfig)

    print("%d docs in index" % writer.numDocs())
    print("Reading lines from sys.stdin...")

    tft = FieldType()
    tft.setStored(True)
    tft.setTokenized(True)
    for n, l in enumerate(sys.stdin):
        doc = Document()
        doc.add(Field("text", l, tft))
        writer.addDocument(doc)
    print("Indexed %d lines from stdin (%d docs in index)" % (n, writer.numDocs()))
    print("Closing index of %d docs..." % writer.numDocs())
    writer.close()

此代码读取输入并存储在index目录中。

# retriever.py
import sys
import lucene

from java.io import File
from org.apache.lucene.analysis.standard import StandardAnalyzer
from org.apache.lucene.document import Document, Field
from org.apache.lucene.search import IndexSearcher
from org.apache.lucene.index import IndexReader, DirectoryReader
from org.apache.lucene.queryparser.classic import QueryParser
from org.apache.lucene.store import SimpleFSDirectory, FSDirectory
from org.apache.lucene.util import Version

if __name__ == "__main__":
    lucene.initVM()
    analyzer = StandardAnalyzer()
    indexPath = File("index/").toPath()
    indexDir = FSDirectory.open(indexPath)
    reader = DirectoryReader.open(indexDir)
    searcher = IndexSearcher(reader)

    query = QueryParser("text", analyzer).parse("hello")
    MAX = 1000
    hits = searcher.search(query, MAX)

    print("Found %d document(s) that matched query '%s':" % (hits.totalHits, query))
    for hit in hits.scoreDocs:
        print(hit.score, hit.doc, hit.toString())
        doc = searcher.doc(hit.doc)
        print(doc.get("text").encode("utf-8"))

我们应该能够检索（搜索），retriever.py但它不返回任何内容。它出什么问题了？

score 2 · Accepted Answer

我认为您开始的最佳方式是下载 PyLucene 的 tarball（您选择的版本）：

https://www.apache.org/dist/lucene/pylucene/

在里面你会找到一个带有 python 测试的test3/文件夹（用于 python3，否则test2/用于 python2）。这些涵盖了常见的操作，例如索引、读取、搜索等等。鉴于 Pylucene 的文档严重缺乏，我发现这些非常有帮助。

结帐test_Pylucene.py特别。

笔记

如果更改日志对您来说不够直观，这也是快速掌握更改并跨版本调整代码的好方法。

（为什么我不在这个答案中提供代码：在 SO 的 PyLucene 答案中提供代码片段的问题在于，一旦发布新版本，这些代码片段就会很快过时，正如我们在大多数已经存在的版本中所看到的那样。 )

score 1 · Accepted Answer

In []: tft.indexOptions()
Out[]: <IndexOptions: NONE>

尽管文档中DOCS_AND_FREQS_AND_POSITIONS是默认设置，但情况已不再如此。这是 a 的默认值TextField；FieldType必须明确setIndexOptions。

python - PyLucene 索引器和检索器示例

2 回答 2

Related

Reference