2

(对不起这个奇怪的标题,但我想不出一个简短的方法来表达这个)

由于我在我提出的最后一个问题中设法过度简化了我的问题,所以这次我将向您提供实际问题。

提供的数据框包含列“usr”、“usrMsgCnt”和“isRefound”,其中 usr 是名称,usrMsgCnt 是数字,isRefound 是二进制。

将添加一个新列,其值计算如下:

usrMsgCnt/ usr 等于该行的 usr 并且 isRefound 等于 1 的行数

对于示例数据的第一行,新值将是:

9 / 5 其中 5 由 length(data$usr[data$usr=="Jan.Schrader" & data$isRefound==1])

考虑到原始数据集的大小,循环遍历不是一种选择

这是一小部分数据的输入

structure(list(usr = structure(c(21L, 21L, 21L, 21L, 6L, 5L, 
6L, 6L, 6L, 21L, 20L, 21L, 6L, 20L, 21L, 21L, 21L, 6L, 6L, 6L
), .Label = c("alsmith", "Amanda.Coles", "Andrew.Coles", "babsimieth", 
"Bernd.Ludwig", "Bernhard.Schiemann", "bfueck", "Bram.Ridder", 
"brian.tripney", "carlosgardeazabal", "christine.elsweiler", 
"cmfinner", "daniel.goncalves", "david", "de56", "eko.ma", "freundlu", 
"gmcphail", "ian.ferguson", "Ian.Ruthven", "Jan.Schrader", "jearmour", 
"jyang", "Laura.Schnall", "Marc.Roper", "marek.maleika", "Martin.Hacker", 
"martin.scholz", "maziminke", "mclanger", "Michael.Cashmore", 
"morgan.harvey", "mrussell", "msherrif", "murray.wood", "Nadine.Mahrholz", 
"noam.ascher", "pburns", "Peter.Gregory", "raina", "robertnm", 
"ronald.teijeira", "ronaldtf", "sbenus", "starmstr", "steve.neely", 
"Sven.Friedemann", "tinchen"), class = "factor"), usrMsgCnt = c(9L, 
9L, 9L, 9L, 5L, 0L, 5L, 5L, 5L, 9L, 0L, 9L, 5L, 0L, 9L, 9L, 9L, 
37L, 37L, 37L), isRefound = c(0L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 
1L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 1L, 0L)), .Names = c("usr", 
"usrMsgCnt", "isRefound"), row.names = c(NA, 20L), class = "data.frame")
4

1 回答 1

6

假设isRefound实际上是二进制的:

library(data.table)
DT <- data.table(DF,key="usr")

DT[,newvar:=usrMsgCnt/sum(isRefound),by=usr]

编辑:如果顺序是必不可少的,则不应设置键(对 data.table 进行排序)并创建索引变量(为了安全起见)。

DT <- data.table(DF)
DT[,id:=.I]
DT[,newvar:=usrMsgCnt/sum(isRefound),by=usr]
print(DT)

#                    usr usrMsgCnt isRefound id newvar
#  1:       Jan.Schrader         9         0  1    1.8
#  2:       Jan.Schrader         9         1  2    1.8
#  3:       Jan.Schrader         9         1  3    1.8
#  4:       Jan.Schrader         9         1  4    1.8
#  5: Bernhard.Schiemann         5         1  5    1.0
#  6:       Bernd.Ludwig         0         0  6    NaN
#  7: Bernhard.Schiemann         5         0  7    1.0
#  8: Bernhard.Schiemann         5         1  8    1.0
#  9: Bernhard.Schiemann         5         1  9    1.0
# 10:       Jan.Schrader         9         1 10    1.8
# 11:        Ian.Ruthven         0         0 11    NaN
# 12:       Jan.Schrader         9         0 12    1.8
# 13: Bernhard.Schiemann         5         1 13    1.0
# 14:        Ian.Ruthven         0         0 14    NaN
# 15:       Jan.Schrader         9         0 15    1.8
# 16:       Jan.Schrader         9         0 16    1.8
# 17:       Jan.Schrader         9         1 17    1.8
# 18: Bernhard.Schiemann        37         0 18    7.4
# 19: Bernhard.Schiemann        37         1 19    7.4
# 20: Bernhard.Schiemann        37         0 20    7.4

可以将相同的概念方法与您上一个问题中演示的基本 R 方法和 plyr 方法一起使用:

within(DF, {
  newvar <- usrMsgCnt/ave(isRefound, usr, FUN = sum)
})

library(plyr)
ddply(DF, .(usr), transform,
      newvar = usrMsgCnt/sum(isRefound))

但是,对于大型数据集,data.table 包的性能将更加出色。

于 2013-03-21T19:35:36.183 回答