lucene - 在 Lucene 中使用哪个术语向量选项？

Question

我在 Lucene 中进行索引，并且只对从 Lucene 中获取相关文档的 ID 感兴趣（即，不是字段值或任何突出显示的信息）。鉴于这些要求，我应该使用哪个术语向量，而不影响搜索性能（速度）或质量（结果）？我也将使用 MoreLikeThis 所以不想要

TermVector.YES—Records the unique terms that occurred, and their counts, in each document, but doesn’t store any positions or offsets information

TermVector.WITH_POSITIONS—Records the unique terms and their counts, and also the positions of each occurrence of every term, but no offsets

TermVector.WITH_OFFSETS—Records the unique terms and their counts, with the offsets (start and end character position) of each occurrence of every term, but no positions

TermVector.WITH_POSITIONS_OFFSETS—Stores unique terms and their counts, along with positions and offsets

谢谢。

score 0 · Accepted Answer

这取决于您的查询类型...如果您有任何与您的 ID 相关的数据，那么您将需要职位和/或offets。

如果您有这样的文件：“blah blah blah date blah ID blah name blah”

并且您只想找到该特定 ID，然后 TermVector Yes 就可以了。但是，如果您想根据与日期或名称的接近程度（使用高级查询）来查找 ID，您将需要额外的术语位置。

你总是可以试试这个，这是一个简单的改变，假设你不必对十亿记录索引或其他东西进行单元测试:)

顺便说一句...查看我们的“Lucene in Action”，这本书涵盖了所有这些信息。

lucene - 在 Lucene 中使用哪个术语向量选项？

1 回答 1

Related

Reference