3

我有一组在几天内进行的测量。测量次数通常为 4。在任何测量中可以捕获的数字范围是 1-5(在现实生活中,给定测试集,范围可能高达 100 或低至 20)。

我想每天计算每个值在当天之前发生了多少。

让我用一些示例数据来解释:

# test data creation
d1 = list(as.Date("2013-5-4"),  4,2)
d2 = list(as.Date("2013-5-9"),  2,5)
d3 = list(as.Date("2013-5-16"), 3,2)
d4 = list(as.Date("2013-5-30"), 1,4)

d = rbind(d1,d2,d3,d4)
colnames(d) <- c("Date", "V1", "V2")

tt = as.data.table(d)

我想运行一个函数,该函数将添加 5 列(可能值范围内每个值可能 1 个)。在每一列中,我想要在测试日期之前出现该值的 COUNT。

例如,2013-5-30 函数的输出将是C1=0, C2=3, C3=1, C4=1, C5=1.

它计算了多少次:

1 出现在之前且不包括 5/30,即为零
2 出现在之前且不包括 5/30,即为三个
3 出现在之前且不包括 5/30,即为一等

此外,它还应该包括一列,显示该数字占总测量值的百分比。例如5/30,在 5/30 之前有 6 次测量,所以

pc1=(0/6), pc2=3/6, pc3=1/6, pc4=1/6, pc5= 1/6

我想使用 data.table 分配符号( := )一次性添加这些多列。我正在寻找的输出格式为:

Date V1 V2 C1 PC1 C2 PC2 C3 PC3 C4 PC4 C5 PC5
4

3 回答 3

5

1. 数据表

首先将问题中的奇怪构造替换为t更常见的构造:

library(data.table)
t <- data.table(
  Date = as.Date(c("2013-5-4", "2013-5-9", "2013-5-16", "2013-5-30")),
  V1 = c(4, 2, 3, 1),
  V2 = c(2, 5, 2, 4)
)

现在tabulate每一行并用于cumsum累积先前的行。 perm是一个置换向量,用于重新排列 C 列 (nc + 1:n) 和 PC 列 (nc + n + 1:n) 的列号。

nc <- ncol(t) # 3
n <- t[, max(V1, V2)] # 5

Cnames <- paste0("C", 1:n)
PCnames <- paste0("PC", 1:n)

perm <- c(1:nc, rbind(nc + 1:n, nc + n + 1:n))

t[, (Cnames) := as.list(tabulate(c(V1, V2), n)), by = 1:nrow(t)
  ][, (Cnames):=lapply(.SD, function(x) cumsum(x) - x), .SDcol=Cnames
    ][, (PCnames):=lapply(.SD, function(x) x/seq(0,len=.N,by=nc-1)), .SDcols=Cnames
      ][, ..perm]

最后一行给出:

         Date V1 V2 C1 PC1 C2 PC2 C3       PC3 C4       PC4 C5       PC5
1: 2013-05-04  4  2  0 NaN  0 NaN  0       NaN  0       NaN  0       NaN
2: 2013-05-09  2  5  0   0  1 0.5  0 0.0000000  1 0.5000000  0 0.0000000
3: 2013-05-16  3  2  0   0  2 0.5  0 0.0000000  1 0.2500000  1 0.2500000
4: 2013-05-30  1  4  0   0  3 0.5  1 0.1666667  1 0.1666667  1 0.1666667

1a.data.table 替代方案

如果可以省略第一个日期的行(这不是很有用,因为在第一个日期之前没有日期),那么我们可以执行以下乏味但直接的自连接:

t <- data.table(
  Date = as.Date(c("2013-5-4", "2013-5-9", "2013-5-16", "2013-5-30")),
  V1 = c(4, 2, 3, 1),
  V2 = c(2, 5, 2, 4)
)
tt <- t[, one := 1]
setkey(tt, one)
tt[tt,,allow.cartesian=TRUE][Date > Date.1, list(
    C1 = sum(.SD == 1), PC1 = mean(.SD == 1), 
    C2 = sum(.SD == 2), PC2 = mean(.SD == 2), 
    C3 = sum(.SD == 3), PC3 = mean(.SD == 3), 
    C4 = sum(.SD == 4), PC4 = mean(.SD == 4), 
    C5 = sum(.SD == 5), PC5 = mean(.SD == 5)
), by = list(Date, V1, V2), .SDcols = c("V1.1", "V2.1")]

1b。数据表替代

或者我们可以更紧凑地重写 1a (其中ttn和来自上面):CnamesPCnames

tt[tt,,allow.cartesian=TRUE][Date > Date.1, setNames(as.list(rbind(
   sapply(1:n, function(i, .SD) sum(.SD==i), .SD=.SD),
   sapply(1:n, function(i, .SD) mean(.SD==i), .SD=.SD)
  )), c(rbind(Cnames, PCnames))),
  by = list(Date, V1, V2), .SDcols = c("V1.1", "V2.1")]

2.sqldf

data.table 的替代方法是将 SQL 与这种同样乏味但直接的自联接一起使用:

library(sqldf)
sqldf("select a.Date, a.V1, a.V2, 
sum(((b.V1 = 1) + (b.V2 = 1)) * (a.Date > b.Date)) C1,
sum(((b.V1 = 1) + (b.V2 = 1)) * (a.Date > b.Date)) / 
cast (2 * count(*) - 2 as real) PC1,
sum(((b.V1 = 2) + (b.V2 = 2)) * (a.Date > b.Date)) C2,
sum(((b.V1 = 2) + (b.V2 = 2)) * (a.Date > b.Date)) / 
cast (2 * count(*) - 2 as real) PC2,
sum(((b.V1 = 3) + (b.V2 = 3)) * (a.Date > b.Date)) C3,
sum(((b.V1 = 3) + (b.V2 = 3)) * (a.Date > b.Date)) / 
cast (2 * count(*) - 2 as real) PC3,
sum(((b.V1 = 4) + (b.V2 = 4)) * (a.Date > b.Date)) C4,
sum(((b.V1 = 4) + (b.V2 = 4)) * (a.Date > b.Date)) / 
cast (2 * count(*) - 2 as real) PC4,
sum(((b.V1 = 5) + (b.V2 = 5)) * (a.Date > b.Date)) C5,
sum(((b.V1 = 5) + (b.V2 = 5)) * (a.Date > b.Date)) / 
cast (2 * count(*) - 2 as real) PC5
from t a, t b where a.Date >= b.Date
group by a.Date")

2a. sqldf 替代方案

另一种方法是使用字符串操作来创建上面的 sql 字符串,如下所示:

f <- function(i) {
    s <- fn$identity("sum(((b.V1 = $i) + (b.V2 = $i)) * (a.Date > b.Date))")
    fn$identity("$s C$i,\n $s /\ncast (2 * count(*) - 2 as real) PC$i")
}
s <- fn$identity("select a.Date, a.V1, a.V2, `toString(sapply(1:5, f))`
    from t a, t b where a.Date >= b.Date
    group by a.Date")

sqldf(s)

2b。第二个 sqldf 替代方案

如果我们愿意在第一个日期没有输出行,则可以大大简化 sql 解决方案。这可能是有道理的,因为第一个日期没有以前的日期可以制表:

sqldf("select a.Date, a.V1, a.V2, 
sum((b.V1 = 1) + (b.V2 = 1)) C1,
avg((b.V1 = 1) + (b.V2 = 1)) PC1,
sum((b.V1 = 2) + (b.V2 = 2)) C2,
avg((b.V1 = 2) + (b.V2 = 2)) PC2,
sum((b.V1 = 3) + (b.V2 = 3)) C3,
avg((b.V1 = 3) + (b.V2 = 3)) PC3,
sum((b.V1 = 4) + (b.V2 = 4)) C4,
avg((b.V1 = 4) + (b.V2 = 4)) PC4,
sum((b.V1 = 5) + (b.V2 = 5)) C5,
avg((b.V1 = 5) + (b.V2 = 5)) PC5
from t a, t b where a.Date > b.Date
group by a.Date")

同样,可以创建 sql 字符串以避免以与先前解决方案中所示相同的方式重复。

更新:添加了 PC 列和一些简化

更新 2:添加了额外的解决方案

于 2013-06-15T18:13:29.770 回答
1

这是一个开始。我认为没有理由“一口气”做到这一点。这可能是可能的。自己试试。

library(data.table)
DT = as.data.table(d)

DT[,i:=as.numeric(Date)]
setkey(DT,"i")

uv <- 1:max(unlist(DT[,2:3]))
DT[,paste0("C",uv):=lapply(uv,function(x) x %in% unlist(.SD)),.SDcols=2:3,by=i]
DT[,paste0("C",uv):=lapply(.SD,function(x) c(NA,head(cumsum(x),-1))),.SDcols=paste0("C",uv)]
DT[,paste0("PC",uv):=lapply(.SD,function(x) x/(2*.I-2)),.SDcols=paste0("C",uv)]

#          Date V1 V2     i C1 C2 C3 C4 C5 PC1 PC2       PC3       PC4       PC5
# 1: 2013-05-04  4  2 15829 NA NA NA NA NA  NA  NA        NA        NA        NA
# 2: 2013-05-09  2  5 15834  0  1  0  1  0   0 0.5 0.0000000 0.5000000 0.0000000
# 3: 2013-05-16  3  2 15841  0  2  0  1  1   0 0.5 0.0000000 0.2500000 0.2500000
# 4: 2013-05-30  1  4 15855  0  3  1  1  1   0 0.5 0.1666667 0.1666667 0.1666667
于 2013-06-15T17:43:23.310 回答
0

您可能需要 %in%操作员。

> foo<-sample(1:10,4)
> bar<-sample(1:10,3)
> foo
[1] 5 3 9 6
> bar
[1] 1 7 2
> bar2<-sample(1:10,5)
> bar2
[1] 2 9 4 8 5
> which(bar2%in%foo)
[1] 2 5   #those are the indices of the values in bar2 which appear in foo

> which(bar%in%foo)
 integer(0)
于 2013-06-15T17:05:12.720 回答