r - 在循环中对数据进行子集化并将结果写入列表

Question

我有包含五个变量的数据框。其中两个是公制测量，其中三个包含存储为因子的组。我尝试通过不同的组在一个循环中将该数据帧子集三次，并计算每个组的每个度量测量的平均值。结果可以存储为新列表中的新数据框。现在我使用subset和ldply从plyr包中。单个子集没有问题，但是当我尝试将循环的结果存储在向量中时，我收到一条警告消息，指出number of items to replace is not a multiple of replacement length. 可以在下面找到示例代码。任何帮助将非常感激！

df<-data.frame(a=c(1:5),b=c(21:25),group1=c("a","b","a","a","b"),group2=c("b","a","c","b","c"),group3=c("a","b","c","d","c"))

# single subset
llply(subset(df,group1=="a")[1:2],mean)

# subset for all groups
# create grouplist
grouplist<-colnames(df[3:5])
# create vector to store results
output.vector<-vector()

# create loop
for (i in grouplist)output.vector[i]<-ldply(subset(df,grouplist=="a")[1:2],mean)

output.vector

Warning messages:
1: In output.vector[i] <- ldply(subset(df, grouplist == "a")[1:2],  :
  number of items to replace is not a multiple of replacement length

因此列表中一项的输出如下所示：

output.vector$group1
         |a|    | b|
|a|     |2.67|  |3.5|
|b|     |22.7|  |23.5|

output.vector$group2
     |a|    | b|    |c|
|a|  |2|    |2.5|   |4|
|b|  |22|   |22.5|  |24|

output.vector$group3
     |a|     |b|    |c|    |d|
|a|  |1|     |2|    |4|    |4|
|b|  |21|    |22|   |24|   |14|

score 3 · Accepted Answer

基本包中的另一个选项使用byand colMeans，并循环通过组列：

 id.group <- grepl('group',colnames(df))
 lapply(df[,id.group],
       function(x){
         res <- by(df[,!id.group],x,colMeans)
         do.call(rbind,res)
       })
$group1
         a        b
a 2.666667 22.66667
b 3.500000 23.50000

$group2
    a    b
a 2.0 22.0
b 2.5 22.5
c 4.0 24.0

$group3
  a  b
a 1 21
b 2 22
c 4 24
d 4 24

编辑添加一些基准测试

library(microbenchmark)
microbenchmark(ag(),dr(),an())

Unit: milliseconds
 expr       min        lq    median        uq      max neval
 ag()  4.717987  4.936251  5.072595  5.394017 27.13639   100
 dr() 14.676580 15.244331 15.689392 16.252781 43.76198   100
 an() 14.691750 15.159945 15.625107 16.312705 46.01326   100

看起来 agstudy 解决方案是赢家，比其他 2 个解决方案快 3 倍！

这里使用的函数：

ag <- function(){
id.group <- grepl('group',colnames(df))
lapply(df[,id.group],
       function(x){
         res <- by(df[,!id.group],x,colMeans)
         do.call(rbind,res)
       })
}
dr <- function(){

grouplist<-colnames(df[3:5])
lapply(grouplist, function(n) 
  daply(df, n, function(d) colMeans(d[, 1:2])))
}


an <- function(){
temp <- melt(df, id.vars=1:2)
setNames(
  lapply(unique(temp$variable), function(x) {
    aggregate(. ~ value, temp[temp$variable == x, c(1, 2, 4)], mean)
  }), unique(temp$variable))
}

score 2 · Accepted Answer

一种方法是首先将您的数据转换为长格式，然后使用lapplyand aggregate.

这是长格式的数据。

library(reshape2)
temp <- melt(df, id.vars=1:2)
temp
#    a  b variable value
# 1  1 21   group1     a
# 2  2 22   group1     b
# 3  3 23   group1     a
# 4  4 24   group1     a
# 5  5 25   group1     b
# 6  1 21   group2     b
# 7  2 22   group2     a
# 8  3 23   group2     c
# 9  4 24   group2     b
# 10 5 25   group2     c
# 11 1 21   group3     a
# 12 2 22   group3     b
# 13 3 23   group3     c
# 14 4 24   group3     d
# 15 5 25   group3     c

这是计算。我相信你感兴趣的所有计算都在那里。

setNames(
  lapply(unique(temp$variable), function(x) {
    aggregate(. ~ value, temp[temp$variable == x, c(1, 2, 4)], mean)
  }), unique(temp$variable))
# $group1
#   value        a        b
# 1     a 2.666667 22.66667
# 2     b 3.500000 23.50000
# 
# $group2
#   value   a    b
# 1     a 2.0 22.0
# 2     b 2.5 22.5
# 3     c 4.0 24.0
# 
# $group3
#   value a  b
# 1     a 1 21
# 2     b 2 22
# 3     c 4 24
# 4     d 4 24

score 2 · Accepted Answer

这可以使用包的组合来lapply完成：daplyplyr

grouplist<-colnames(df[3:5])
lapply(grouplist, function(n) daply(df, n, function(d) colMeans(d[, 1:2])))

# [[1]]
#       
# group1        a        b
#      a 2.666667 22.66667
#      b 3.500000 23.50000
# 
# [[2]]
#       
# group2   a    b
#      a 2.0 22.0
#      b 2.5 22.5
#      c 4.0 24.0
# 
# [[3]]
#       
# group3 a  b
#      a 1 21
#      b 2 22
#      c 4 24
#      d 4 24

r - 在循环中对数据进行子集化并将结果写入列表

3 回答 3

Related

Reference