r - 'R', 'mice', missing variable imputation - how to only do one column in sparse matrix

Question

I have a matrix that is half-sparse. Half of all cells are blank (na) so when I try to run the 'mice' it tries to work on all of them. I'm only interested in a subset.

Question: In the following code, how do I make "mice" only operate on the first two columns? Is there a clean way to do this using row-lag or row-lead, so that the content of the previous row can help patch holes in the current row?

set.seed(1)

#domain
x <- seq(from=0,to=10,length.out=1000)

#ranges
y <- sin(x) +sin(x/2) + rnorm(n = length(x))
y2 <- sin(x) +sin(x/2) + rnorm(n = length(x))

#kill 50% of cells
idx_na1 <- sample(x=1:length(x),size = length(x)/2)
y[idx_na1] <- NA

#kill more cells
idx_na2 <- sample(x=1:length(x),size = length(x)/2)
y2[idx_na2] <- NA

#assemble base data
my_data <- data.frame(x,y,y2)

#make the rest of the data
for (i in 3:50){


     my_data[,i] <- rnorm(n = length(x))
     idx_na2 <- sample(x=1:length(x),size = length(x)/2)
     my_data[idx_na2,i] <- NA

}

#imputation
est <- mice(my_data)

data2 <- complete(est)

str(data2[,1:3])

Places that I have looked for answers:

help document (link)
google of course...
https://stats.stackexchange.com/questions/99334/fast-missing-data-imputation-in-r-for-big-data-that-is-more-sophisticated-than-s

score 8 · Accepted Answer

我认为您要查找的内容可以通过修改鼠标功能的参数“where”来完成。参数“where”等于一个矩阵（或数据框），其大小与您执行插补的数据集相同。默认情况下，“where”参数等于 is.na(data)：当数据集中缺少该值时，单元格等于“TRUE”，否则等于“FALSE”。这意味着默认情况下，将估算数据集中的每个缺失值。现在，如果您想更改此设置并仅估算数据集中特定列（在我的示例第 2 列中）中的值，您可以执行以下操作：

# Define arbitrary matrix with TRUE values when data is missing and FALSE otherwise
A <- is.na(data)
# Replace all the other columns which are not the one you want to impute (let say column 2)
A[,-2] <- FALSE 
# Run the mice function
imputed_data <- mice(data, where = A)

score 3 · Accepted Answer

而不是where参数，一种更快的方法可能是使用method参数。您可以将此参数设置""为要跳过的列/变量。缺点是无法自动确定该方法。所以：

imp <- mice(data,
            method = ifelse(colnames(data) == "your_var", "logreg", ""))

但是您可以从文档中获取默认方法：

defaultMethod

...默认情况下，该方法使用pmm预测均值匹配（数字数据）logreg、逻辑回归插补（二元数据、具有 2 个水平的因子）polyreg、无序分类数据的多头回归插补（因子 > 2 水平）polr、比例优势模型（有序，> 2 个级别）。

score 1 · Accepted Answer

你的问题对我来说并不完全清楚。你是说你希望只对两列进行操作吗？在那种情况下mice(my_data[,1:2])会起作用。或者您想使用所有数据但只填写某些列的缺失值？为此，我只需按照以下几行创建一个指标矩阵：

isNA <- data.frame(apply(my_data, 2, is.na))
est <- mice(my_data)

mapply(function(x, isna) {
  x[isNA == 1] <- NA
  return(x)
}, <each MI mice return object column-wise>,  isNA)

对于您的最后一个问题，“我可以mice用于滚动数据插补吗？” 我相信答案是否定的。但是您应该仔细检查文档。

r - 'R', 'mice', missing variable imputation - how to only do one column in sparse matrix

3 回答 3

Related

Reference