2

我是 R 新手(以前使用过 MATLAB)并且已经尝试了很长时间来寻找解决方案,但我找不到这个(看似)非常简单的问题的解决方案。这是问题所在;

在第一列中,我有几天的时间值(在此示例中进行了简化),在第二列中,我有我想要平均的值。我想要做的是取所有属于同一时间的值并取平均值。我在相当大的数据集上执行此操作,因此自动执行此操作将有很大帮助。

让我们设置:

time = rep( c("00:00", "00:10", "00:20", "00:30", "00:40", "00:50", "01:00", "01:10"), 5)
values = c(sample(1:100, 40))
data = cbind(time, values)

所以现在我有我的矩阵,其中包含时间和值,我想将所有具有(例如)“00:00”的值分组并计算它的平均值。经过一番搜索,我发现该aggregate()功能可以提供很好的帮助,所以我做了以下事情;

aggregate(as.numeric(data[,-1]), by = list(sort(data[,1])), mean) 

有输出

    Group.1    x
1   00:00 77.2
2   00:10 59.2
3   00:20 51.0
4   00:30 49.4
5   00:40 51.4
6   00:50 33.4
7   01:00 33.8
8   01:10 51.6

所以它似乎工作得很好,但是当我手动计算它时,值的平均值都是不同的。(例如;对于 00:00:(56+3+91+71+8)/5 = 45.8 而不是 77.2),谁能告诉我我做错了什么?

4

3 回答 3

2

@joran 的建议(不要by通过排序来打乱变量)似乎有效:

set.seed(101) ## for reproducibility
time = rep( c("00:00", "00:10", "00:20", "00:30", 
      "00:40", "00:50", "01:00", "01:10"), 5)
values = c(sample(1:100, 40))
data = cbind(time, values)
aggregate(as.numeric(data[,2]),by=list(factor(data[,1])), mean)
##   Group.1    x
## 1   00:00 50.0
## 2   00:10 29.0
## 3   00:20 45.0
## 4   00:30 60.2
## 5   00:40 48.8
## 6   00:50 57.2
## 7   01:00 37.2
## 8   01:10 56.2
##

检查第一组:

mean(as.numeric(data[data[,1]=="00:00","values"]))
## [1] 50

As a further recommendation, I would strongly suggest using data.frame rather than cbind()ing your columns -- this allows you to put times and numeric values together without getting them all coerced to the same type.

(It would be nice to use a built-in times object too: I tried times from the chron package but didn't quite get the hang of it)

dat <- data.frame(time,values)  ## avoid using "data" as a variable name
aggregate(values~time, data=dat, mean)

is much easier to read.

By the way, there are a lot of posts on Stack Overflow comparing various solutions for aggregation (by, aggregate, ddply and friends from the plyr package, and the data.table package): e.g. Elegant way to solve ddply task with aggregate (hoping for better performance) , R: speeding up "group by" operations , How to speed up summarise and ddply? ...

于 2012-11-28T15:55:29.317 回答
1

by是你的朋友:

by(as.numeric(data[,"values"]),data[,"time"],mean)
于 2012-11-28T15:31:49.787 回答
0

我建议将索引变量(时间)设置为使用as.factor().

然后将其用作索引,即:aggregate(data$values,by=list(data$time.factor),FUN=mean)

于 2012-11-28T14:36:40.577 回答