elasticsearch - hive 上的不同计数与 elasticsearch 上的基数计数不匹配

Question

我已经elasticsearch使用elasticsearch-hadoop来自elastic.

我需要获取唯一帐号的计数。我用hql和编写了以下查询queryDSL，但它们返回不同的计数。

蜂巢查询：

select count(distinct account) from <tableName> where capacity="550";

// Returns --> 71132

同样，在 Elasticsearch 中，查询看起来像这样：

{
    "query": {
        "bool": {
            "must": [
              {"match": { "capacity": "550"}}
            ]
        }
    },
    "aggs": {
    "unique_account": {
      "cardinality": {
        "field": "account"
      }
    }
  }
}

// Returns --> 71607

难道我做错了什么？我该怎么做才能匹配这两个查询？

Note:hive 和 elasticsearch 中的记录数完全相同。

score 1 · Accepted Answer

“ Elasticsearch 提供的第一个近似
聚合是 cardinality metric ......
正如本章顶部所提到的，cardinality metric 是一种近似算法。它基于 HyperLogLog++ (HLL) 算法。”

https://www.elastic.co/guide/en/elasticsearch/guide/current/cardinality.html

对于 OP

精度阈值

“precision_threshold 接受 0–40,000 之间的数字。较大的值被视为等同于 40,000
。...
虽然算法不能保证，但如果基数低于阈值，它几乎总是 100% 准确。高于此的基数将开始用准确性换取内存节省，一个小错误会蔓延到指标中。”

https://www.elastic.co/guide/en/elasticsearch/guide/current/cardinality.html

您可能还想查看“支持精确基数聚合 #15876”

对于 OP，2

“我试了好几个号码……”

您有71,132 个不同的值，而精度阈值限制为40,000，因此基数超过了阈值，这意味着为了节省内存而牺牲了准确性。
这就是所选实现（基于 HyperLogLog++ 算法）的工作方式。

score 0 · Accepted Answer

即使使用 40000precision_threshold，基数也不能确保准确计数。还有另一种方法可以获得字段的准确不同计数。

这篇关于“来自 Elasticsearch 的 Accurate Distinct Count and Values ”的文章详细解释了解决方案以及它对 Cardinality 的准确性。

elasticsearch - hive 上的不同计数与 elasticsearch 上的基数计数不匹配

2 回答 2

对于 OP

对于 OP，2

Related

Reference