这是一个快速的解决方法,它在很大程度上依赖于内部实际发生的事情(使代码有点脆弱 imo)。因为内部NaN
只是一个非常非常负的数字,所以它总是在你的data.table
前面setkey
。我们可以使用该属性来隔离这些条目,如下所示:
# this will give the index of the first element that is *not* NaN
my.dt[J(-.Machine$double.xmax), roll = -Inf, which = T]
# this is equivalent to my.dt[!is.nan(x)], but much faster
my.dt[seq_len(my.dt[J(-.Machine$double.xmax), roll = -Inf, which = T] - 1)]
这是里卡多样本数据的基准:
my.dt <- as.data.table(replicate(20, sample(100, 1e5, TRUE)))
setnames(my.dt, 1, "ID")
my.dt[sample(1e5, 1e3), ID := NA]
setkey(my.dt, ID)
# NOTE: I have to use integer max here - because this example has integers
# instead of doubles, so I'll just add simple helper function (that would
# likely need to be extended for other cases, but I'm just dealing with the ones here)
minN = function(x) if (is.integer(x)) -.Machine$integer.max else -.Machine$double.xmax
library(microbenchmark)
microbenchmark(normalJ = my.dt[J(1)],
naJ = my.dt[seq_len(my.dt[J(minN(ID)), roll = -Inf, which = T] - 1)])
#Unit: milliseconds
# expr min lq median uq max neval
# normalJ 1.645442 1.864812 2.120577 2.863497 5.431828 100
# naJ 1.465806 1.689350 2.030425 2.600720 10.436934 100
在我的测试中,以下minN
函数还涵盖了字符和逻辑向量:
minN = function(x) {
if (is.integer(x)) {
-.Machine$integer.max
} else if (is.numeric(x)) {
-.Machine$double.xmax
} else if (is.character(x)) {
""
} else if (is.logical(x)) {
FALSE
} else {
NA
}
}
你会想要添加mult = 'first'
,例如:
my.dt[seq_len(my.dt[J(minN(colname)), roll = -Inf, which = T, mult = 'first'] - 1)]