r - 汇总已在 r 中分组的数据

Question

使用 R ID=Custid 中的以下数据集

ID Geo Channel Brand Neworstream RevQ112 RevQ212 RevQ312
1  NA  On-line  1      New         5         0       1
1  NA  On-line  1      Stream      5         0       1
3  EU  Tele     2       Stream     5         1       0

我想将数据集转换为这种格式的列

ID Geo Brand Neworstream OnlineRevQ112 TeleRevQ112 OnlineRevQ212 TeleRevQ212

这样做的最佳方法是什么？无法找出 R 中的最佳命令。

提前致谢

score 4 · Accepted Answer

您可以使用reshape2包及其函数melt来dcast重构数据。

data <- structure(list(ID = c(1L, 1L, 3L), Geo = structure(c(NA, NA, 
1L), .Label = "EU", class = "factor"), Channel = structure(c(1L, 
1L, 2L), .Label = c("On-line", "Tele"), class = "factor"), Brand = c(1L, 
1L, 2L), Neworstream = structure(c(1L, 2L, 2L), .Label = c("New", 
"Stream"), class = "factor"), RevQ112 = c(5L, 5L, 5L), RevQ212 = c(0L, 
0L, 1L), RevQ312 = c(1L, 1L, 0L)), .Names = c("ID", "Geo", "Channel", 
"Brand", "Neworstream", "RevQ112", "RevQ212", "RevQ312"), class = "data.frame", row.names = c(NA, 
-3L)) 

library(reshape2)
## melt data
df_long<-melt(data,id.vars=c("ID","Geo","Channel","Brand","Neworstream"))

## recast in combinations of channel and time frame
dcast(df_long,... ~Channel+variable,sum)

score 2 · Accepted Answer

更新/facepalm

数据集中的“NA”可能不是NA值，而是北美的缩写“NA”或类似的东西。

如果您在读取数据时使用过，那么按照我最初的指示na.strings使用应该没有问题：reshape

mydf <- read.table(header = TRUE, na.strings = "", 
text = 'ID Geo Channel Brand Neworstream RevQ112 RevQ212 RevQ312
1  NA  On-line  1      New         5         0       1
1  NA  On-line  1      Stream      5         0       1
3  EU  Tele     2       Stream     5         1       0')

reshape(mydf, direction = "wide",
        idvar = c("ID", "Geo", "Brand", "Neworstream"),
        timevar = "Channel")

（但是，我可能会建议您更改缩写以提高可读性并减少混淆！）

原始答案（因为那里还有一些有趣的东西`reshape`）

这应该这样做：

reshape(mydf, direction = "wide", 
        idvar = c("ID", "Geo", "Brand", "Neworstream"), 
        timevar = "Channel")
#   ID  Geo Brand Neworstream RevQ112.On-line RevQ212.On-line RevQ312.On-line
# 1  1 <NA>     1         New               5               0               1
# 3  3   EU     2      Stream              NA              NA              NA
#   RevQ112.Tele RevQ212.Tele RevQ312.Tele
# 1           NA           NA           NA
# 3            5            1            0

更新（尝试挽救答案）

正如@Arun 指出的那样，上述内容并不完全正确。这里的罪魁祸首是，当指定多个 ID 变量时interaction()，它用于创建一个新的临时 ID 变量。reshape()

以下是应用到我们的“mydf”对象时的行reshape()及其外观：

data[, tempidname] <- interaction(data[, idvar], drop = TRUE)
interaction(mydf[c(1, 2, 4, 5)], drop = TRUE)
# [1] <NA>          <NA>          3.EU.2.Stream
# Levels: 3.EU.2.Stream

嗯。这似乎简化为两个 ID，NA并且3.EU.2.Stream.

如果我们替换NA为会发生什么""？

mydf$Geo <- as.character(mydf$Geo)
mydf$Geo[is.na(mydf$Geo)] <- ""
interaction(mydf[c(1, 2, 4, 5)], drop = TRUE)
# [1] 1..1.New      1..1.Stream   3.EU.2.Stream
# Levels: 1..1.New 1..1.Stream 3.EU.2.Stream

啊啊。这样好一点。我们现在有了三个唯一的 ID……而且reshape()似乎可以工作。

reshape(mydf, direction = "wide", 
        idvar=names(mydf)[c(1, 2, 4, 5)], 
        timevar="Channel")
#   ID Geo Brand Neworstream RevQ112.On-line RevQ212.On-line
# 1  1         1         New               5               0
# 2  1         1      Stream               5               0
# 3  3  EU     2      Stream              NA              NA
#   RevQ312.On-line RevQ112.Tele RevQ212.Tele RevQ312.Tele
# 1               1           NA           NA           NA
# 2               1           NA           NA           NA
# 3              NA            5            1            0

r - 汇总已在 r 中分组的数据

2 回答 2

更新/facepalm

原始答案（因为那里还有一些有趣的东西reshape）

更新（尝试挽救答案）

Related

Reference

原始答案（因为那里还有一些有趣的东西`reshape`）