5

I have a data frame made by row binding many data frames, each identified with a unique key. I wish to calculate the correlation coefficients for columns in each subset (using the unique key) of the big data frame. For example, using the mtcars data I might want to calculate the correlation between columns hp and wt for each unique value in column cyl. I could do it in a loop

data("mtcars")
for(i in c(4,6,8)){
temp = subset(mtcars,mtcars$cyl==i)
cor(temp$hp,temp$wt)
}

I think aggregate would be better, but this code doesn't work:

data("mtcars")
aggregate(mtcars,by=mycars$cyl,cor)
4

2 回答 2

9

You could use

 data("mtcars")
 library(plyr)
 ddply(mtcars, "cyl", function(x) cor(x$hp, x$wt))

This splits the data in mtcars by cyl, applies for each subset x the function cor(x$hp, x$wt) and then aggregates the results for each of the subsets in a data.frame.

I can highly recommend the plyr package. It's one of the packages I use most in R.


Edit: As per request, here a dplyr version. I have to say that I am not a big dplyr user, but the code should be ok.

library(dplyr)
mtcars %>% group_by(cyl) %>% summarise(V1=cor(hp, wt))
于 2013-04-24T01:20:17.527 回答
8

baseR 中,它的工作是splitand lapplyorsapply

lapply(split(mtcars, mtcars$cyl), function(X) cor(X$hp, X$wt))
## $`4`
## [1] 0.1598761
## 
## $`6`
## [1] -0.3062284
## 
## $`8`
## [1] 0.01761795
## 


sapply(split(mtcars, mtcars$cyl), function(X) cor(X$hp, X$wt))
##           4           6           8 
##  0.15987614 -0.30622844  0.01761795 
于 2013-04-24T02:12:55.213 回答