r - 将函数应用于组中的组（ddply中的tapply）

Question

我的数据集如下所示：

d = data.frame(year=rep(2000:2002,each=40),month=rep(c(rep(1:12,3),5,6,7,8),3),species=rep(c(rep(letters[1:12],3),"a","b","g","l"),3),species_group=NA,kg=round(rnorm(120,15,6),digits=2))
d$species_group=ifelse(d$species %in% letters[1:5],"A","B")

我希望每年和每个物种组（因此不包括月份和物种的水平）的平均重量和包括的物种数量。这适用于 ddply。但是，我也想包括我的数据“质量”的价值。也就是说，如果每个月的物种数量是平衡的，或者例如在夏季月份包含的物种更多。因此，我想我可以简单地计算每月独特物种数量的年度标准偏差。我尝试在 ddply 中使用 tapply 执行此操作，如下所示：

s=ddply(d,c("year","species_group"),function(x) cbind(n_species=length(unique(x$species)),
                                                   quality=tapply(x,x$month,sd(length(unique(x$species)))),
                                                   kg=sum(x$kg,na.rm=T)))

但这给了我一个错误

Error in match.fun(FUN) : 'sd(length(unique(x$species)))' is not a function, character or symbol

我想获得的是这样的：

output=data.frame(year=rep(2000:2002,each=2),species_group=rep(c("A","B"),3),n_species=rep(c(7,9),3),quality=round(rnorm(6,2,0.3),digits=2),kg=round(rnorm(6,15,6),digits=2))

我不能首先按月、年和物种组使用 ddply，因为这意味着我不再知道每年独特物种的数量。我想我也可以分别计算 n_species 和 quality 并在之后将它们放在一起，但这将是一种麻烦的方法。

我怎样才能使我的功能工作，或者我怎样才能更正确地做到这一点？

回答：

最简单的解决方案来自 shadow，他注意到我在使用 tapply 时的错误。此外，标准误差应该比标准偏差更合适，给出以下公式：

 s=ddply(d,c("year","species_group"),function(x) cbind(n_species=length(unique(x$species)),
                                                   quality=sd(tapply(x$species,x$month, function(y) length(unique(y))))/sqrt(length(tapply(x$species,x$month, function(y) length(unique(y))))),
                                                   kg=sum(x$kg,na.rm=T)))

score 1 · Accepted Answer

不清楚你如何定义你的质量标准。那么我将如何做到这一点。首先，我在一个单独的函数中定义我的质量标准。请注意，您的函数应该返回单个值而不是向量（在您的解决方案中，您使用tapply的是返回向量）。

## returns the mean of sd variation per month 
get_quality <- 
  function(species,month)
     mean(tapply(species,month,
              FUN=function(s)sd(as.integer(s))),
     na.rm=TRUE)

然后我在ddply. 为了简化代码，我还创建了一个按组应用的函数。

ff <- 
function(x) {
    cbind(n_species=length(unique(x$species)),
        quality= get_quality(x$species,x$month),
        kg=sum(x$kg,na.rm=TRUE))
}
library(plyr)

s=ddply(d,.(year,species_group),ff)


  year species_group n_species   quality     kg
1 2000             A         5 0.4000000 259.68
2 2000             B         7 0.2857143 318.24
3 2001             A         5 0.4000000 285.07
4 2001             B         7 0.2857143 351.54
5 2002             A         5 0.4000000 272.46
6 2002             B         7 0.2857143 331.45

r - 将函数应用于组中的组（ddply中的tapply）

1 回答 1

Related

Reference