0

我正在尝试在具有销售分类变量的数据框架上运行一些基本统计数据(以及以后更深入的统计数据)。除了销售额之外,它还跟踪区域(商家所在的位置)、星期几、一天中的时间(午餐、下班后等)以及其他各种信息。

这是数据的一个小的随机子集:(请注意,这是一个基本表示形式 - 实际数据框有 38 列 - 我只是去掉了大部分不适用的列)

    structure(list(dayofweek = structure(c(4L, 7L, 3L, 7L, 3L, 2L, 
2L, 7L, 3L, 3L, 2L, 7L, 5L, 5L, 2L, 5L, 1L, 3L, 7L, 3L, 4L, 1L, 
3L, 5L, 7L), .Label = c("Friday", "Monday", "Saturday", "Sunday", 
    "Thursday", "Tuesday", "Wednesday"), class = "factor"), timeofday = structure(c(6L, 
4L, 5L, 5L, 2L, 6L, 6L, 5L, 6L, 3L, 6L, 3L, 5L, 4L, 1L, 3L, 5L, 
6L, 5L, 4L, 6L, 6L, 3L, 2L, 5L), .Label = c("After Work", "Early AM", 
     "Evening", "Late AM", "Lunch", "MidAfternoon", "Overnight"), class = "factor"), 
 area = c(6L, 4L, 4L, 5L, 5L, 1L, 4L, 2L, 3L, 2L, 7L, 3L, 
 7L, 5L, 7L, 4L, 1L, 4L, 1L, 4L, 5L, 7L, 1L, 3L, 7L), totsales = c(40, 
 6, 5, 10, 1, 0, 0, 3, 5, 3, 10, 30, 2, 1, 2, 22, 8, 1, 1, 
 5, 11, 20, 0, 1, 5)), .Names = c("dayofweek", "timeofday", 
     "area", "totsales"), class = "data.frame", row.names = c(192278L, 
     140773L, 121051L, 157984L, 154299L, 258034L, 108031L, 43760L, 
     78005L, 42103L, 95603L, 98431L, 30252L, 165303L, 40713L, 108252L, 
     304549L, 137041L, 268473L, 124599L, 161253L, 12897L, 240815L, 
     89439L, 21032L))

我要做的第一件事是尝试获得每个区域和一天中每个时间的平均销售额和中位数销售额。我想让 R 遍历每个列表并返回所有值。我试过这个:

vallist<-list(a= c("Early AM", "Late AM", "Lunch", "MidAfternoon", "After Work", 
         "Evening", "Overnight"),
          b= c(1,2,3,4,5,6,7))

sapply(vallist[['b']], function(x)
    mapply(function(a,b) mean(data$totsales[which(data$timeofday==a & data$area==b)]),
          vallist[['a']], vallist[['b']])
 )

但是,它仅将平均值应用于区域 1 中的每个时间段,而不是区域 1-7 中的每个时间段。所以,我的结果是这样的:

                  [,1]      [,2]      [,3]      [,4]      [,5]      [,6]      [,7]
Early AM      9.192847  9.192847  9.192847  9.192847  9.192847  9.192847  9.192847
Late AM       8.020678  8.020678  8.020678  8.020678  8.020678  8.020678  8.020678
Lunch        10.096277 10.096277 10.096277 10.096277 10.096277 10.096277 10.096277
MidAfternoon 11.503961 11.503961 11.503961 11.503961 11.503961 11.503961 11.503961
After Work    8.206124  8.206124  8.206124  8.206124  8.206124  8.206124  8.206124
Evening      11.457599 11.457599 11.457599 11.457599 11.457599 11.457599 11.457599
Overnight    11.415667 11.415667 11.415667 11.415667 11.415667 11.415667 11.415667

这是区域 1 的正确答案,但您可以看到它们对于每个区域都是相同的值。如何让 R 将该函数应用于多个列表并返回所有值组合?

接下来的步骤将是应用中位数,并在地区级别和不同的工作日进行评估,但我认为相同的想法将适用于所有不同的组合。

4

2 回答 2

1

对于这种特殊情况,您可以使用以下方法重现您的结果:

library(reshape2)
dcast(data[-1], timeofday ~ area, fun.aggregate=mean, fill=0)

产生:

     timeofday   1 2  3    4  5  6    7
1   After Work 0.0 0  0  0.0  0  0  2.0
2     Early AM 0.0 0  1  0.0  1  0  0.0
3      Evening 0.0 3 30 22.0  0  0  0.0
4      Late AM 0.0 0  0  5.5  1  0  0.0
5        Lunch 4.5 3  0  5.0 10  0  3.5
6 MidAfternoon 0.0 0  5  0.5 11 40 15.0

我很确定与您的结果的差异是由于您发布的数据是整体的子集。

于 2014-03-03T17:09:07.443 回答
0

将我的评论转换为答案....

看起来您可能感兴趣aggregate(尽管在 R 中有很多方法可以聚合数据)。

out <- aggregate(totsales ~ timeofday + area, data, mean)
out
#       timeofday area totsales
# 1       Evening    1      0.0
# 2         Lunch    1      4.5
# 3  MidAfternoon    1      0.0
# 4       Evening    2      3.0
# 5         Lunch    2      3.0
# 6      Early AM    3      1.0
# 7       Evening    3     30.0
# 8  MidAfternoon    3      5.0
# 9       Evening    4     22.0
# 10      Late AM    4      5.5
# 11        Lunch    4      5.0
# 12 MidAfternoon    4      0.5
# 13     Early AM    5      1.0
# 14      Late AM    5      1.0
# 15        Lunch    5     10.0
# 16 MidAfternoon    5     11.0
# 17 MidAfternoon    6     40.0
# 18   After Work    7      2.0
# 19        Lunch    7      3.5
# 20 MidAfternoon    7     15.0

如果您想从那里转到宽格式,则可以使用reshape(例如:)reshape(out, direction = "wide", idvar="timeofday", timevar="area")

于 2014-03-03T17:20:33.343 回答