r - 有pmin和pmax各取na.rm，为什么没有psum？

Question

似乎 R 可能缺少一个明显的简单功能：psum. 它是作为不同的名称存在，还是在某个包中？

x = c(1,3,NA,5)
y = c(2,NA,4,1)

min(x,y,na.rm=TRUE)    # ok
[1] 1
max(x,y,na.rm=TRUE)    # ok
[1] 5
sum(x,y,na.rm=TRUE)    # ok
[1] 16

pmin(x,y,na.rm=TRUE)   # ok
[1] 1 3 4 1
pmax(x,y,na.rm=TRUE)   # ok
[1] 2 3 4 5
psum(x,y,na.rm=TRUE)
[1] 3 3 4 6                             # expected result
Error: could not find function "psum"   # actual result

我意识到这+已经很像psum了，但是呢NA？

x+y                      
[1]  3 NA NA  6        # can't supply `na.rm=TRUE` to `+`

有什么要补充的psum吗？或者我错过了什么。

这个问题是这个问题的后续：
Using :=in data.table to sum the values of two columns in R, ignoring NAs

score 21 · Accepted Answer

在@JoshUlrich 对上一个问题的评论之后，

psum <- function(...,na.rm=FALSE) { 
    rowSums(do.call(cbind,list(...)),na.rm=na.rm) }

编辑：来自斯文·海恩斯坦：

psum2 <- function(...,na.rm=FALSE) { 
    dat <- do.call(cbind,list(...))
    res <- rowSums(dat, na.rm=na.rm) 
    idx_na <- !rowSums(!is.na(dat))
    res[idx_na] <- NA
    res 
}

x = c(1,3,NA,5,NA)
y = c(2,NA,4,1,NA)
z = c(1,2,3,4,NA)

psum(x,y,na.rm=TRUE)
## [1] 3 3 4 6 0
psum2(x,y,na.rm=TRUE)
## [1] 3 3 4 6 NA

n = 1e7
x = sample(c(1:10,NA),n,replace=TRUE)
y = sample(c(1:10,NA),n,replace=TRUE)
z = sample(c(1:10,NA),n,replace=TRUE)

library(rbenchmark)
benchmark(psum(x,y,z,na.rm=TRUE),
          psum2(x,y,z,na.rm=TRUE),
          pmin(x,y,z,na.rm=TRUE), 
          pmax(x,y,z,na.rm=TRUE), replications=20)

##                          test replications elapsed relative 
## 4  pmax(x, y, z, na.rm = TRUE)           20  26.114    1.019 
## 3  pmin(x, y, z, na.rm = TRUE)           20  25.632    1.000 
## 2 psum2(x, y, z, na.rm = TRUE)           20 164.476    6.417
## 1  psum(x, y, z, na.rm = TRUE)           20  63.719    2.486

Sven 的版本（可以说是正确的版本）要慢一些，尽管它是否重要显然取决于应用程序。有人想破解内联/Rcpp 版本吗？

至于为什么它不存在：不知道，但祝你好运让 R-core 做这样的添加......我不能随便想到一个足够广泛的*misc包可以用来......

Matthew 在 r-devel 上的后续线程在这里（这似乎证实了）：
r-devel：有 pmin 和 pmax 各取 na.rm，psum 怎么样？

score 7 · Accepted Answer

在 CRAN 上快速搜索后，至少有 3 个包具有psum功能。rccmisc,incadata和kit. kit似乎是最快的。下面再现了 Ben Bolker 的例子。

benchmark(
  rccmisc::psum(x,y,z,na.rm=TRUE),
  incadata::psum(x,y,z,na.rm=TRUE),
  kit::psum(x,y,z,na.rm=TRUE), 
  psum(x,y,z,na.rm=TRUE),
  psum2(x,y,z,na.rm=TRUE),
  replications=20
)
#                                    test replications elapsed relative
# 2 incadata::psum(x, y, z, na.rm = TRUE)           20   20.05   14.220
# 3      kit::psum(x, y, z, na.rm = TRUE)           20    1.41    1.000
# 4           psum(x, y, z, na.rm = TRUE)           20    8.04    5.702
# 5          psum2(x, y, z, na.rm = TRUE)           20   20.44   14.496
# 1  rccmisc::psum(x, y, z, na.rm = TRUE)           20   23.24   16.482

score 1 · Accepted Answer

另一种方法的优点是也可以使用矩阵，就像pminand一样pmax。

psum <- function(..., na.rm = FALSE) {
  plus_na_rm <- function(x, y) ifelse(is.na(x), 0, x) + ifelse(is.na(y), 0, y)
  Reduce(if(na.rm) plus_na_rm else `+`, list(...))
}

x = c(1,3,NA,5)
y = c(2,NA,4,1)

psum(x, y)
#> [1]  3 NA NA  6
psum(x, y, na.rm = TRUE)
#> [1] 3 3 4 6

# With matrices
A <- matrix(1:9, nrow = 3)
B <- matrix(c(NA, 2:8, NA), nrow = 3)

psum(A, B)
#>      [,1] [,2] [,3]
#> [1,]   NA    8   14
#> [2,]    4   10   16
#> [3,]    6   12   NA
psum(A, B, na.rm = TRUE)
#>      [,1] [,2] [,3]
#> [1,]    1    8   14
#> [2,]    4   10   16
#> [3,]    6   12    9

^{由reprex 包（v0.3.0）于 2020 年 3 月 9 日创建}

一个警告：如果一个元素NA跨越所有求和的对象 and na.rm = TRUE，结果将是0（而不是NA）。

例如：

psum(NA, NA, na.rm = TRUE)
#> [1] 0

r - 有pmin和pmax各取na.rm，为什么没有psum？

3 回答 3

Related

Reference