r - 查找列中具有不同值的对象的平均值

Question

我是 R 新手（以前使用过 MATLAB）并且已经尝试了很长时间来寻找解决方案，但我找不到这个（看似）非常简单的问题的解决方案。这是问题所在；

在第一列中，我有几天的时间值（在此示例中进行了简化），在第二列中，我有我想要平均的值。我想要做的是取所有属于同一时间的值并取平均值。我在相当大的数据集上执行此操作，因此自动执行此操作将有很大帮助。

让我们设置：

time = rep( c("00:00", "00:10", "00:20", "00:30", "00:40", "00:50", "01:00", "01:10"), 5)
values = c(sample(1:100, 40))
data = cbind(time, values)

所以现在我有我的矩阵，其中包含时间和值，我想将所有具有（例如）“00:00”的值分组并计算它的平均值。经过一番搜索，我发现该aggregate()功能可以提供很好的帮助，所以我做了以下事情；

aggregate(as.numeric(data[,-1]), by = list(sort(data[,1])), mean)

有输出

    Group.1    x
1   00:00 77.2
2   00:10 59.2
3   00:20 51.0
4   00:30 49.4
5   00:40 51.4
6   00:50 33.4
7   01:00 33.8
8   01:10 51.6

所以它似乎工作得很好，但是当我手动计算它时，值的平均值都是不同的。（例如；对于 00:00：(56+3+91+71+8)/5 = 45.8 而不是 77.2），谁能告诉我我做错了什么？

score 2 · Accepted Answer

@joran 的建议（不要by通过排序来打乱变量）似乎有效：

set.seed(101) ## for reproducibility
time = rep( c("00:00", "00:10", "00:20", "00:30", 
      "00:40", "00:50", "01:00", "01:10"), 5)
values = c(sample(1:100, 40))
data = cbind(time, values)
aggregate(as.numeric(data[,2]),by=list(factor(data[,1])), mean)
##   Group.1    x
## 1   00:00 50.0
## 2   00:10 29.0
## 3   00:20 45.0
## 4   00:30 60.2
## 5   00:40 48.8
## 6   00:50 57.2
## 7   01:00 37.2
## 8   01:10 56.2
##

检查第一组：

mean(as.numeric(data[data[,1]=="00:00","values"]))
## [1] 50

As a further recommendation, I would strongly suggest using data.frame rather than cbind()ing your columns -- this allows you to put times and numeric values together without getting them all coerced to the same type.

(It would be nice to use a built-in times object too: I tried times from the chron package but didn't quite get the hang of it)

dat <- data.frame(time,values)  ## avoid using "data" as a variable name
aggregate(values~time, data=dat, mean)

is much easier to read.

By the way, there are a lot of posts on Stack Overflow comparing various solutions for aggregation (by, aggregate, ddply and friends from the plyr package, and the data.table package): e.g. Elegant way to solve ddply task with aggregate (hoping for better performance) , R: speeding up "group by" operations , How to speed up summarise and ddply? ...

score 1 · Accepted Answer

1

by是你的朋友：

by(as.numeric(data[,"values"]),data[,"time"],mean)

于 2012-11-28T15:31:49.787 回答

score 0 · Accepted Answer

我建议将索引变量（时间）设置为使用as.factor().

然后将其用作索引，即：aggregate(data$values,by=list(data$time.factor),FUN=mean)

r - 查找列中具有不同值的对象的平均值

3 回答 3

Related

Reference