java - Java的全文搜索解决方案？

Question

有大量不同种类的实体：

interface Entity {
}

interface Entity1 extends Entity {
  String field1();
  String field2();
}

interface Entity2 extends Entity {
  String field1();
  String field2();
  String field3();
}

interface Entity3 extends Entity {
  String field12();
  String field23();
  String field34();
}

Set<Entity> entities = ...

任务是实现这个集合的全文搜索。通过全文搜索，我的意思是我只需要获取包含我正在寻找的子字符串的实体（我不需要知道确切的属性、该子字符串所在位置的确切偏移量等）。在当前实现中，Entity接口有一个方法matches(String)：

interface Entity {
  boolean matches(String text);
}

每个实体类都根据其内部实现它：

class Entity1Impl implements Entity1 {
  public String field1() {...}
  public String field2() {...}

  public boolean matches(String text) {
    return field1().toLowerCase().contains(text.toLowerCase()) ||
           field2().toLowerCase().contains(text.toLowerCase());
  }
}

我相信这种方法真的很糟糕（尽管它有效）。每次我有一个新集合时，我都在考虑使用 Lucene 来构建索引。索引是指内容-> id映射。内容只是我正在考虑的所有领域的一个微不足道的“总和”。因此，Entity1内容将是field1()和的串联field2()。我对性能有些怀疑：构建索引通常是一项相当昂贵的操作，所以我不确定它是否有帮助。

你有什么其他的建议？

澄清细节：

Set<Entity> entities = ...是〜10000个项目。
Set<Entity> entities = ...不是从数据库中读取的，所以我不能只添加where ...条件。数据源非常重要，所以我无法解决它的问题。
Entities应该被认为是短篇文章，因此某些字段可能高达 10KB，而其他字段可能约为 10 字节。
我需要经常执行此搜索，但查询字符串和原始集每次都不同，所以看起来我不能只建立一次索引（因为实体集每次都不同）。

score 2 · Accepted Answer

2

我强烈考虑将 Lucene 与 SOLR 一起使用。http://lucene.apache.org/java/docs/index.html

于 2011-09-25T10:46:29.810 回答

score 1 · Accepted Answer

对于这样一个复杂的对象域，您可以使用像Compass这样的 lucene 包装工具，它允许使用与 ORM 相同的方法（如休眠）快速将您的对象图映射到 lucene 索引

java - Java的全文搜索解决方案？

2 回答 2

Related

Reference