r - n_distinct 是磁盘帧的精确计算吗？

Question

我在一个大文件（>30GB）上运行 n_distinct ，它似乎没有产生确切的结果。

我有另一个数据参考点，并且输出在磁盘帧聚合中关闭。

它在文档中提到 n_distinct 是精确计算，而不是估计。

那正确吗？

score 1 · Accepted Answer

n_distinct可以在此页面上找到实现https://github.com/xiaodaigh/disk.frame/blob/master/R/one-stage-verbs.R

#' @export
#' @rdname one-stage-group-by-verbs
n_distinct_df.chunk_agg.disk.frame <- function(x, na.rm = FALSE, ...) {
  if(na.rm) {
    setdiff(unique(x), NA)
  } else {
    unique(x)
  }
}

#' @export
#' @importFrom dplyr n_distinct
#' @rdname one-stage-group-by-verbs
n_distinct_df.collected_agg.disk.frame <- function(listx, ...) {
  n_distinct(unlist(listx))
}

现在，它看起来是我想要的精确计算。逻辑很简单，它计算unique每个块内的值，然后计算n_distinct所有块的结果。

但我不能排除其他地方是否存在错误。

您是否有测试用例表明它不完全正确？也许你可以贡献一个 PR 来测试？

r - n_distinct 是磁盘帧的精确计算吗？

1 回答 1

Related

Reference