我目前正在尝试通过高斯混合模型来估算丢失的数据。我的参考论文来自这里: http: //mlg.eng.cam.ac.uk/zoubin/papers/nips93.pdf
我目前专注于具有 2 个高斯分量的双变量数据集。这是定义每个高斯分量权重的代码:
myData = faithful[,1:2]; # the data matrix
for (i in (1:N)) {
prob1 = pi1*dmvnorm(na.exclude(myData[,1:2]),m1,Sigma1); # probabilities of sample points under model 1
prob2 = pi2*dmvnorm(na.exclude(myData[,1:2]),m2,Sigma2); # same for model 2
Z<-rbinom(no,1,prob1/(prob1 + prob2 )) # Z is latent variable as to assign each data point to the particular component
pi1<-rbeta(1,sum(Z)+1/2,no-sum(Z)+1/2)
if (pi1>1/2) {
pi1<-1-pi1
Z<-1-Z
}
}
这是我定义缺失值的代码:
> whichMissXY<-myData[ which(is.na(myData$waiting)),1:2]
> whichMissXY
eruptions waiting
11 1.833 NA
12 3.917 NA
13 4.200 NA
14 1.750 NA
15 4.700 NA
16 2.167 NA
17 1.750 NA
18 4.800 NA
19 1.600 NA
20 4.250 NA
我的限制是,如何根据特定组件在“等待”变量中估算缺失的数据。这段代码是我第一次尝试使用条件均值插补来插补缺失数据。我知道,这绝对是错误的方式。结果不会对特定组件说谎并产生异常值。
miss.B2 <- which(is.na(myData$waiting))
for (i in miss.B2) {
myData[i, "waiting"] <- m1[2] + ((rho * sqrt(Sigma1[2,2]/Sigma1[1,1])) * (myData[i, "eruptions"] - m1[1] ) + rnorm(1,0,Sigma1[2,2]))
#print(miss.B[i,])
}
如果有人可以就如何改进可以通过高斯混合模型处理潜在/隐藏变量的插补技术提供任何建议,我将不胜感激。先感谢您