r - 加速函数：在计算平均值之前检查 NA 计数

Question

下面的函数计算向量的平均值。NA但是，它首先检查向量中 ' 的比例，如果高于给定阈值，则返回 NA 而不是均值。

我的问题是我目前的实现效率很低。比简单地运行要长 7 倍以上mean(vec, na.rm=TRUE)

我尝试了另一种使用的方法na.omit，但速度更慢。

考虑到我的数据量，执行单曲lapply需要 40 多分钟。

关于如何更快地完成相同任务的任何建议？

更新 - 回复：@thelatemail 的解决方案和@Arun 的评论：

我正在数百组中执行此功能，每组的大小各不相同。这个问题中提供的样本数据（最初）是作为一个整洁的数据框提供的，只是为了便于创建人工数据。

替代样本数据以避免混淆

# Sample Data 
# ------------
  set.seed(1)
  # slightly different sizes for each group
  N1 <- 5e3
  N2 <- N1 + as.integer(rnorm(1, 0, 100))

  # One group has only a moderate amount of NA's
  SAMP1 <- rnorm(N1)
  SAMP1[sample(N1, .25 * N1, FALSE)] <- NA  # add in NA's

  # Another group has many NA's
  SAMP2 <- rnorm(N2)
  SAMP2[sample(N2, .95 * N2, FALSE)] <- NA  # add in large number of NA's

  # put them all in a list
  SAMP.NEW <- list(SAMP1, SAMP2)

  # keep it clean
  rm(SAMP1, SAMP2)

# Execute 
# -------    
  lapply(SAMP.NEW, meanIfThresh)

原始样本数据、功能等

# Sample Data 
# ------------
  set.seed(1)
  rows <- 20000  # actual data has more than 7M rows
  cols <-  1000  

  SAMP <- replicate(cols, rnorm(rows))
  SAMP[sample(length(SAMP), .25 * length(SAMP), FALSE)] <- NA  # add in NA's

  # Select 5 random rows, and have them be 90% NA
  tooSparse <- sample(rows, 5)
  for (r in tooSparse)
    SAMP[r, sample(cols, cols * .9, FALSE)] <- NA

# Function 
# ------------
    meanIfThresh <- function(vec, thresh=12/15) { 
      # Calculates the mean of vec, however, 
      #   if the number of non-NA values of vec is less than thresh, returns NA 

      # thresh : represents how much data must be PRSENT. 
      #          ie, if thresh is 80%, then there must be at least 


      len <- length(vec)

      if( (sum(is.na(vec)) / len) > thresh)
        return(NA_real_)
      # if the proportion of NA's is greater than the threshold, return NA
      # example:  if I'm looking at 14 days, and I have 12 NA's,
      #            my proportion is 85.7 % = (12 / 14)
      #            default thesh is  80.0 % = (12 / 15)
      #            Thus, 12 NAs in a group of 14 would be rejected


    # else, calculate the mean, removing NA's
    return(mean(vec, na.rm=TRUE))       
  }


  # Execute
  # -----------------
  apply(SAMP, 1, meanIfThresh)

  # Compare with `mean`
  #----------------
  plain    <- apply(SAMP, 1, mean, na.rm=TRUE)
  modified <- apply(SAMP, 1, meanIfThresh)

  # obviously different
  identical(plain, modified)
  plain[tooSparse]
  modified[tooSparse]


  microbenchmark( "meanIfThresh"   = apply(SAMP, 1, meanIfThresh)
                , "mean (regular)" = apply(SAMP, 1, mean, na.rm=TRUE)
                , times = 15L)

 #  With the actual data, the penalty is sevenfold
 #  Unit: seconds
 #           expr      min       lq   median       uq      max neval
 #   meanIfThresh 1.658600 1.677472 1.690460 1.751913 2.110871    15
 # mean (regular) 1.422478 1.485320 1.503468 1.532175 1.547450    15

score 5 · Accepted Answer

难道你不能像这样替换高 NA 行的平均值吗？：

# changed `result <- apply(SAMP,1,mean,na.rm=TRUE)`
result <- rowMeans(SAMP, na.rm=TRUE)
NArows <- rowSums(is.na(SAMP))/ncol(SAMP) > 0.8
result[NArows] <- NA

一些基准测试：

Ricardo <- function(vec, thresh=12/15) {
    len <- length(vec)
    if( (sum(is.na(vec)) / len) > thresh)
        return(NA_real_)
    return(mean(vec, na.rm=TRUE))       
}

DanielFischer <- function(vec, thresh=12/15) {

    len <- length(vec)
    nas <- is.na(vec)
    Nna <- sum(nas)
    if( (Nna / len) > thresh)
        return(NA_real_)
    return(sum(vec[!nas])/(len-Nna))
}

thelatemail <- function(mat) {
    result <- rowMeans(mat, na.rm=TRUE)
    NArows <- rowSums(is.na(mat))/ncol(mat) > 0.8
    result[NArows] <- NA
    result
}

require(microbenchmark)
microbenchmark(m1 <- apply(SAMP, 1, Ricardo), 
               m2 <- apply(SAMP, 1, DanielFischer), 
               m3 <- thelatemail(SAMP), times = 5L)

Unit: milliseconds
                                expr       min        lq    median        uq       max neval
       m1 <- apply(SAMP, 1, Ricardo) 2923.7260 2944.2599 3066.8204 3090.8127 3105.4283     5
 m2 <- apply(SAMP, 1, DanielFischer) 2643.4883 2683.1034 2755.7032 2799.5155 3089.6015     5
                m3 <- latemail(SAMP)  337.1862  340.6339  371.6148  376.5517  383.4436     5

all.equal(m1, m2) # TRUE
all.equal(m1, m3) # TRUE

score 1 · Accepted Answer

是不是你必须在你的函数中通过你的向量两次vec？如果你可以存储你的NA第一个，也许它可以加快你的计算速度：

meanIfThresh2 <- function(vec, thresh=12/15) { 

  len <- length(vec)
  nas <- is.na(vec)
  Nna <- sum(nas)
  if( (Nna / len) > thresh)
    return(NA_real_)

  return(sum(vec[!nas])/(len-Nna))
}

编辑：我执行了类似的基准测试，以查看对此更改的影响：

> microbenchmark(  "meanIfThresh"   = apply(SAMP, 1, meanIfThresh)
+                 , "meanIfThresh2"   = apply(SAMP, 1, meanIfThresh2)
+                 , "mean (regular)" = apply(SAMP, 1, mean, na.rm=TRUE)
+                 , times = 15L)
Unit: seconds
           expr      min       lq   median       uq      max neval
   meanIfThresh 2.009858 2.156104 2.158372 2.166092 2.192493    15
  meanIfThresh2 1.825470 1.828273 1.829424 1.834407 1.872028    15
 mean (regular) 1.868568 1.882526 1.889852 1.893564 1.907495    15

r - 加速函数：在计算平均值之前检查 NA 计数

更新 - 回复：@thelatemail 的解决方案和@Arun 的评论：

替代样本数据以避免混淆

原始样本数据、功能等

2 回答 2

Related

Reference