我有一个包含我们产品所有文档的索引。文档字段是:
- 团体
- 姓名
- 版本
- 文件
- ...
因为我们的大多数文档都有多个站点,所以我为每个站点创建了索引中的一个文档。因此,当我按组、名称和版本搜索产品时,我会得到一些结果。但有时我希望这种组合(组、名称和版本)只有一个结果(无论产品存在多少文档)。
因此我使用了 DuplicateFilter:
因为这个过滤器只能用于一个字段(而不是字段组合),所以我创建了另一个字段(productkey)。在此字段中,我存储了该产品的 id(组、名称和版本字段组合的 md5Hashvalue)。然后我告诉 DuplicateFilter 使用这个字段来过滤重复项。
但现在我没有得到所有预期的搜索结果。IE:
文件:
group | name | version | productkey | description
a | one | 1.0 | 808d8f96138b7dec7cc69c2769176424 | ...
a | two | 1.0 | 0225635fc76ed8b88c65c7eb9f2ec1f9 | ...
a | two | 1.0 | 0225635fc76ed8b88c65c7eb9f2ec1f9 | ...
a | three| 1.0 | 621e2597b189ee8d9448f6bfb26c5a8f | ...
a | three| 1.0 | 621e2597b189ee8d9448f6bfb26c5a8f | ...
a | three| 1.0 | 621e2597b189ee8d9448f6bfb26c5a8f | ...
a | three| 1.0 | 621e2597b189ee8d9448f6bfb26c5a8f | ...
a | three| 1.0 | 621e2597b189ee8d9448f6bfb26c5a8f | ...
a | four | 1.0 | 3d03056a0d0f29f63477ee1f130b7ae8 | ...
a | four | 1.0 | 3d03056a0d0f29f63477ee1f130b7ae8 | ...
a | four | 1.0 | 3d03056a0d0f29f63477ee1f130b7ae8 | ...
a | four | 1.0 | 3d03056a0d0f29f63477ee1f130b7ae8 | ...
a | four | 1.0 | 3d03056a0d0f29f63477ee1f130b7ae8 | ...
a | four | 1.0 | 3d03056a0d0f29f63477ee1f130b7ae8 | ...
a | five | 1.0 | b2d49bc320325007e1466a38e41ce69a | ...
a | five | 1.0 | b2d49bc320325007e1466a38e41ce69a | ...
a | five | 1.0 | b2d49bc320325007e1466a38e41ce69a | ...
a | five | 1.0 | b2d49bc320325007e1466a38e41ce69a | ...
a | five | 1.0 | b2d49bc320325007e1466a38e41ce69a | ...
zz | one | 1.0 | b610a470c9a7d2cc928725e1fb1a577a | ...
zz | one | 1.0 | b610a470c9a7d2cc928725e1fb1a577a | ...
zz | one | 1.0 | b610a470c9a7d2cc928725e1fb1a577a | ...
zz | two | 1.0 | f5bb84453af30dd5f229d04cdb787dec | ...
zz | three| 1.0 | 4b86d91feded953e57fb3d1ccbf0fc6e | ...
zz | three| 1.0 | 4b86d91feded953e57fb3d1ccbf0fc6e | ...
zz | three| 1.0 | 4b86d91feded953e57fb3d1ccbf0fc6e | ...
结果:
group | name | version | productkey
a | two | 1.0 | 0225635fc76ed8b88c65c7eb9f2ec1f9
a | three| 1.0 | 621e2597b189ee8d9448f6bfb26c5a8f
zz | two | 1.0 | f5bb84453af30dd5f229d04cdb787dec
所以我错过了这些产品:
group | name | version | productkey
a | one | 1.0 | 808d8f96138b7dec7cc69c2769176424
a | four | 1.0 | 3d03056a0d0f29f63477ee1f130b7ae8
a | five | 1.0 | b2d49bc320325007e1466a38e41ce69a
zz | one | 1.0 | b610a470c9a7d2cc928725e1fb1a577a
zz | three| 1.0 | 4b86d91feded953e57fb3d1ccbf0fc6e
这是我实例化过滤器的代码:
DuplicateFilter filter = new DuplicateFilter("productkey");
filter.setKeepMode(DuplicateFilter.KM_USE_FIRST_OCCURRENCE);
filter.setProcessingMode(DuplicateFilter.PM_FULL_VALIDATION);
我犯了错误还是重复过滤器中的错误(可能是长字段值等)?
我正在使用 Lucene 3.6。