lucene - Lucene 复杂结构搜索

Question

基本上我确实有非常简单的数据库，我想用 Lucene 来索引。域是：

// Person domain
class Person {
  Set<Pair> keys;
}

// Pair domain
class Pair {
  KeyItem keyItem;
  String value;
}

// KeyItem domain, name is unique field within the DB (!!)
class KeyItem{
  String name;
}

我有几千万个配置文件和几亿个对，但是，由于大部分 KeyItem 的“名称”字段重复，所以只有几十个 KeyItem 实例。提出该结构以保存 KeyItem 实例。

基本上，任何带有任何字段的配置文件都可以保存到该结构中。假设我们已经配置了属性

- name: Andrew Morton
- eduction:  University of New South Wales, 
- country: Australia, 
- occupation: Linux programmer.

为了存储它，我们将有一个 Profile 实例，4 个 KeyItem 实例：姓名、教育、国家和职业，以及 4 个具有值的 Pair 实例：“Andrew Morton”、“University of New South Wales”、“Australia”和“Linux程序员”。

所有其他配置文件将引用（全部或部分）相同的 KeyItem 实例：姓名、教育、国家和职业。

我的问题是，如何索引所有这些，以便我可以在 Profile 中搜索 KeyItem::name 和 Pair::value 的某些特定值。理想情况下，我希望这种查询起作用：

姓名：Andrew* 职业：Linux*

我应该创建自定义索引器和搜索器吗？或者我可以使用标准的并以某种方式将 KeyItem 和 Pair 映射为 Lucene 组件？

score 3 · Accepted Answer

我相信您可以使用标准的 Lucene 方法。我会：

将每个配置文件转换为 Lucene 文档。
将每一对翻译成本文档中的一个字段。所有字段都需要索引，但不一定要存储。
将带有配置文件 ID 的存储字段添加到文档。
与您的示例类似，使用名称：值对进行搜索。

If you choose bare Lucene, you will need a custom Indexer and Searcher, but they are not hard to build. It may be easier for you to use Solr, where you need less programming. However, I do not know if Solr allows an open-ended schema like the one I described - I believe you have to predefine all field names, so this may prevent you from using Solr.

score 1 · Accepted Answer

Lucene 基本上根据关键字的出现返回命中文档列表，而不管查询的类型如何。基本段阅读器检查整个索引数据库中是否存在关键字，而不是在特定的感兴趣领域中。

建议引入执行以下操作的自定义搜索器。

1.使用文档id读取入围文档。（我猜可能会覆盖 collect() 方法以从 IndexSearcher 类的 search() 传递文档 ID）。
2.获取字段值并检查预期关键字的存在。
3.仅当文档符合您的自定义标准时才对文档进行评分。

注意：可以修改默认的标准搜索器，而不是从头开始编写自定义搜索器。

lucene - Lucene 复杂结构搜索

2 回答 2

Related

Reference