elasticsearch - 选择与弹性搜索不同

Question

我有一组属于少数作者的文件：

[
  { id: 1, author_id: 'mark', content: [...] },
  { id: 2, author_id: 'pierre', content: [...] },
  { id: 3, author_id: 'pierre', content: [...] },
  { id: 4, author_id: 'mark', content: [...] },
  { id: 5, author_id: 'william', content: [...] },
  ...
]

我想根据作者的 id 检索和分页最佳匹配文档的不同选择：

[
  { id: 1, author_id: 'mark', content: [...], _score: 100 },
  { id: 3, author_id: 'pierre', content: [...], _score: 90 },
  { id: 5, author_id: 'william', content: [...], _score: 80 },
  ...
]

这是我目前正在做的事情（伪代码）：

unique_docs = res.results.to_a.uniq{ |doc| doc.author_id }

问题就在分页上：如何选择 20 个“不同”的文档？

有些人指向术语 facets，但我实际上并没有做标签云：

谢谢，
阿迪特

score 4 · Accepted Answer

由于目前ElasticSearch 不提供 group_by 等效项，因此我尝试手动执行此操作。
虽然 ES 社区正在努力直接解决这个问题（可能是一个插件），但这是一个满足我需求的基本尝试。

假设。

我正在寻找相关内容
我假设前 300 个文档是相关的，所以我考虑将我的研究限制在这个选择范围内，不管其中许多或部分来自相同的少数作者。
为了我的需要，我“真的”不需要完整的分页，通过 ajax 更新的“显示更多”按钮就足够了。

缺点

结果并不精确
，因为我们每次获取 300 个文档，我们不知道会出现多少个独特的文档（可能是同一作者的 300 个文档！）。您应该了解它是否符合每位作者的平均文档数，并可能考虑一个限制。
您需要进行 2 次查询（等待远程通话费用）：
- 第一个查询要求仅包含以下字段的 300 个相关文档：id 和 author_id
- 在第二个查询中检索分页 ID 的完整文档

这是一些 ruby 伪代码：https ://gist.github.com/saxxi/6495116

score 0 · Accepted Answer

现在 'group_by' 问题已更新，您可以从elastic 1.3.0 #6124使用此功能。

如果您搜索以下查询，

{
    "aggs": {
        "user_count": {
            "terms": {
                "field": "author_id",
                "size": 0
            }
        }
    }
}

你会得到结果

{
  "took" : 123,
  "timed_out" : false,
  "_shards" : { ... },
  "hits" : { ... },
  "aggregations" : {
    "user_count" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [ {
        "key" : "mark",
        "doc_count" : 87350
      }, {
        "key" : "pierre",
        "doc_count" : 41809
      }, {
        "key" : "william",
        "doc_count" : 24476
      } ]
    }
  }
}

elasticsearch - 选择与弹性搜索不同

2 回答 2

Related

Reference