r - calculate differences in dataframe

Question

I have a dataframe that looks like this:

set.seed(50)
data.frame(distance=c(rep("long", 5), rep("short", 5)),
           year=rep(2002:2006),
           mean.length=rnorm(10))

   distance year mean.length
1      long 2002  0.54966989
2      long 2003 -0.84160374
3      long 2004  0.03299794
4      long 2005  0.52414971
5      long 2006 -1.72760411
6     short 2002 -0.27786453
7     short 2003  0.36082844
8     short 2004 -0.59091244
9     short 2005  0.97559055
10    short 2006 -1.44574995

I need to calculate the difference between in mean.length between long and short in each year. Whats fastest way of doing this?

score 5 · Accepted Answer

这是使用 plyr 的一种方法：

set.seed(50)
df <- data.frame(distance=c(rep("long", 5),rep("short", 5)),
                 year=rep(2002:2006),
                 mean.length=rnorm(10))

library(plyr)
aggregation.fn <- function(df) {
  data.frame(year=df$year[1],
             diff=(df$mean.length[df$distance == "long"] -
                   df$mean.length[df$distance == "short"]))}
new.df <- ddply(df, "year", aggregation.fn)

给你

> new.df
  year       diff
1 2002  0.8275344
2 2003 -1.2024322
3 2004  0.6239104
4 2005 -0.4514408
5 2006 -0.2818542

第二种方式

df <- df[order(df$year, df$distance), ]
n <- dim(df)[1]
df$new.year <- c(1, df$year[2:n] != df$year[1:(n-1)])
df$diff <- c(-diff(df$mean.length), NA)
df$diff[!df$new.year] <- NA
new.df.2 <- df[!is.na(df$diff), c("year", "diff")]

all(new.df.2 == new.df)  # True

score 3 · Accepted Answer

tapply()像这样使用apply()：

apply(
  with(x, tapply(mean.length, list(year, distance), FUN=mean)),
  1, 
  diff
)

      2002       2003       2004       2005       2006 
-0.8275344  1.2024322 -0.6239104  0.4514408  0.2818542

这是有效的，因为通过andtapply创建了一个表格摘要：yeardistance

with(x, tapply(mean.length, list(year, distance), FUN=mean))

            long      short
2002  0.54966989 -0.2778645
2003 -0.84160374  0.3608284
2004  0.03299794 -0.5909124
2005  0.52414971  0.9755906
2006 -1.72760411 -1.4457499

score 2 · Accepted Answer

由于您似乎有成对的值并且 data.frame 是有序的，您可以这样做：

res <- with(DF, mean.length[distance=="long"]-mean.length[distance=="short"])
names(res) <- unique(DF$year)

#     2002       2003       2004       2005       2006 
#0.8275344 -1.2024322  0.6239104 -0.4514408 -0.2818542

这应该很快，但不如其他答案安全，因为它依赖于假设。

score 1 · Accepted Answer

对于计算手头的特定问题，您已经收到了一些很好的答案。考虑将数据重塑为宽格式可能对您有意义。这里有两个选项：

reshape(df, direction = "wide", idvar = "year", timevar = "distance")
#---
  year mean.length.long mean.length.short
1 2002       0.54966989        -0.2778645
2 2003      -0.84160374         0.3608284
3 2004       0.03299794        -0.5909124
4 2005       0.52414971         0.9755906
5 2006      -1.72760411        -1.4457499

#package reshape2 is probably easier to use.
library(reshape2)
dcast(year ~ distance, data = df)
#---
  year        long      short
1 2002  0.54966989 -0.2778645
2 2003 -0.84160374  0.3608284
3 2004  0.03299794 -0.5909124
4 2005  0.52414971  0.9755906
5 2006 -1.72760411 -1.4457499

您现在可以轻松计算新的统计数据。

r - calculate differences in dataframe

4 回答 4

Related

Reference