performance - Most efficient way to replace lowest list values in dataframe in R

Question

I have a dataframe, df, with a list/vector of numbers recorded for each subject for two repetitions of a test item.

subj item rep vec
s1 1 1 [2,1,4,5,8,4,7]
s1 1 2 [1,1,3,4,7,5,3]
s1 2 1 [6,5,4,1,2,5,5]
s1 2 2 [4,4,4,0,1,4,3]
s2 1 1 [4,6,8,7,7,5,8]
s2 1 2 [2,5,4,5,8,1,4]
s2 2 1 [9,3,2,6,6,8,5]
s2 2 2 [7,1,2,3,2,7,3]

For each item, I want find 50% the mean of rep 1 and then replace the lowest numbers in the rep 2 vector with 0, until the mean of rep2 is less than or equal to the mean of rep1. For example, for s1 item1:

mean(c(2,1,4,5,8,4,7))*0.5 = 2.1 #rep1 scaled down
mean(c(1,1,3,4,7,5,3)) = 3.4 #rep2
mean(c(0,0,0,0,7,5,0)) = 1.7 #new rep2 such that mean(rep2) <= mean(rep1)

After removing the lowest numbers in rep 2 vector, I want to correlate the rep1 and rep2 vectors and perform some other minor arithmetic functions and append the results to another (length initialized) dataframe. For now, I'm doing this with loops similar to this pseudo code:

for subj in subjs:
  for item in items:
     while mean(rep2) > mean(rep1)*0.5:
       rep2 = replace(lowest(rep2),0)
     newDataFrame[i] = correl(rep1,rep2)

Doing this with loops seems really inefficient; in R, is there a more efficient way to find and replace the lowest values in a list/vector until the means are less than or equal to a value that depends on that specific item? And what's the best way to append correlations and other results to other dataframes?

Additional info:

>dput(df)
>structure(list(subj = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 
 2L), .Label = c("s1", "s2"), class = "factor"), item = c(1L, 
 1L, 2L, 2L, 1L, 1L, 2L, 2L), rep = c(1L, 2L, 1L, 2L, 1L, 2L, 
 1L, 2L), vec = list(c(2, 1, 4, 5, 8, 4, 7), c(1, 1, 3, 4, 7, 
 5, 3), c(6, 5, 4, 1, 2, 5, 5), c(4, 4, 4, 0, 1, 4, 3), c(4, 6, 
 8, 7, 7, 5, 8), c(2, 5, 4, 5, 8, 1, 4), c(9, 3, 2, 6, 6, 8, 5
 ), c(7, 1, 2, 3, 2, 7, 3))), .Names = c("subj", "item", "rep", 
 "vec"), row.names = c(NA, -8L), class = "data.frame")

I want this dataframe as the output (with rep1 vs. rep2 correlation and rep1 vs new rep2 correlation).

subj item origCorrel newCorrel
s1 1 .80 .51
s1 2 .93 .34
s2 1 .56 .40
s2 2 .86 .79

score 1 · Accepted Answer

摆脱循环的典型策略是将子集数据上的所有计算都放入它们自己的函数中，然后在aggregateor函数中调用该apply函数。

two.cors=function(x,ratio=.5) {
  rep1=unlist(x[1,][['vec']])
  rep2=unlist(x[2,][['vec']])
  orig.cor=cor(rep1,rep2)
     while(mean(rep2) > mean(rep1)*ratio) {
   rep2[    which(rep2==min(rep2[which(!rep2==0)]))]=0
    }
  c(orig.cor,wierd.cor=cor(rep1,rep2))
}

我想使用 daply 所以 get plyr，可以使用聚合或基本*apply函数

library(plyr)

然后在您的数据集上调用该函数

 daply(df,c("subj","item"), .fun=function(x) two.cors(x,ratio=.4) )

这个输出可以重新格式化，但我把它留给你，因为我认为你需要额外的统计two.cors数据

performance - Most efficient way to replace lowest list values in dataframe in R

1 回答 1

Related

Reference