r - 根据每个键值的非 NA 观察数排除数据

Question

我有一个数据集，其中包含对美国公司回报的每月观察。我试图从我的样本中排除所有非 NA 观察次数少于一定数量的公司。

我设法做我想做的事foreach，但我的数据集非常大，这需要很长时间。这是一个工作示例，它显示了我如何完成我想要的，并希望使我的目标明确

#load required packages
library(data.table)
library(foreach)

#example data
myseries <- data.table(
 X = sample(letters[1:6],30,replace=TRUE),
 Y = sample(c(NA,1,2,3),30,replace=TRUE))

setkey(myseries,"X") #so X is the company identifier

#here I create another data table with each company identifier and its number 
#of non NA observations
nobsmyseries <- myseries[,list(NOBSnona = length(Y[complete.cases(Y)])),by=X]

# then I select the companies which have less than 3 non NA observations
comps <- nobsmyseries[NOBSnona <3,]

#finally I exclude all companies which are in the list "comps", 
#that is, I exclude companies which have less than 3 non NA observations
#but I do for each of the companies in the list, one by one, 
#and this is what makes it slow.

for (i in 1:dim(comps)[1]){
myseries <- myseries[X != comps$X[i],]
}

我怎样才能更有效地做到这一点？有没有data.table办法得到相同的结果？

score 2 · Accepted Answer

如果您希望为 NA 值考虑超过 1 列，那么您可以使用complete.cases(.SD)，但是由于您想测试单个列，我建议类似

naCases <- myseries[,list(totalNA  = sum(!is.na(Y))),by=X]

然后，您可以加入给定的阈值总 NA 值

例如

threshold <- 3
myseries[naCases[totalNA > threshold]]

您还可以选择使用 not join 来获取您已排除的案例

 myseries[!naCases[totalNA > threshold]]

如评论中所述，类似

myseries[,totalNA  := sum(!is.na(Y)),by=X][totalNA > 3]

但是，在这种情况下，您正在对整个 data.table 执行矢量扫描，而之前的解决方案对只有nrow(unique(myseries[['X']])).

鉴于这是一个单一的向量扫描，无论如何它都是有效的（也许二元连接 + 小向量扫描可能比更大的向量扫描慢），但是我怀疑这两种方式都会有很大的不同。

score 2 · Accepted Answer

如何在 X 上聚合 Y 中的 NA 数量，然后进行子集化？

# Aggregate number of NAs
num_nas <- as.data.table(aggregate(formula=Y~X, data=myseries, FUN=function(x) sum(!is.na(x))))

# Subset
myseries[!X %in% num_nas$X[Y>=3],]

r - 根据每个键值的非 NA 观察数排除数据

2 回答 2

Related

Reference