gaussian - 具有高斯混合模型的潜在变量来估算缺失数据

Question

我目前正在尝试通过高斯混合模型来估算丢失的数据。我的参考论文来自这里： http: //mlg.eng.cam.ac.uk/zoubin/papers/nips93.pdf

我目前专注于具有 2 个高斯分量的双变量数据集。这是定义每个高斯分量权重的代码：

myData = faithful[,1:2];    # the data matrix  
for (i in (1:N)) {
        prob1 = pi1*dmvnorm(na.exclude(myData[,1:2]),m1,Sigma1);   # probabilities of sample points under model 1
        prob2 = pi2*dmvnorm(na.exclude(myData[,1:2]),m2,Sigma2);   # same for model 2
        Z<-rbinom(no,1,prob1/(prob1 + prob2 ))    # Z is latent variable as to assign each data point to the particular component 

        pi1<-rbeta(1,sum(Z)+1/2,no-sum(Z)+1/2)
        if (pi1>1/2) {
          pi1<-1-pi1
          Z<-1-Z
        }
      }

这是我定义缺失值的代码：

> whichMissXY<-myData[ which(is.na(myData$waiting)),1:2]
> whichMissXY
   eruptions waiting
11     1.833      NA
12     3.917      NA
13     4.200      NA
14     1.750      NA
15     4.700      NA
16     2.167      NA
17     1.750      NA
18     4.800      NA
19     1.600      NA
20     4.250      NA

我的限制是，如何根据特定组件在“等待”变量中估算缺失的数据。这段代码是我第一次尝试使用条件均值插补来插补缺失数据。我知道，这绝对是错误的方式。结果不会对特定组件说谎并产生异常值。

miss.B2 <- which(is.na(myData$waiting))
for (i in miss.B2) {
    myData[i, "waiting"] <- m1[2] + ((rho * sqrt(Sigma1[2,2]/Sigma1[1,1])) * (myData[i, "eruptions"] - m1[1] ) + rnorm(1,0,Sigma1[2,2]))
    #print(miss.B[i,])  
  }

如果有人可以就如何改进可以通过高斯混合模型处理潜在/隐藏变量的插补技术提供任何建议，我将不胜感激。先感谢您

score 0 · Accepted Answer

这是一种协方差结构的解决方案。

devtools::install_github("alexwhitworth/emclustr")
library(emclustr)
data(faithful)
set.seed(23414L)
ff <- apply(faithful, 2, function(j) {
  na_idx <- sample.int(length(j), 50, replace=F)
  j[na_idx] <- NA
  return(j)
})
ff2 <- em_clust_mvn_miss(ff, nclust=2)

# hmm... seems I don't return the imputed values. 
# note to self to update the code    
plot(faithful, col= ff2$mix_est)

和参数输出

$it
[1] 27

$clust_prop
[1] 0.3955708 0.6044292

$clust_params
$clust_params[[1]]
$clust_params[[1]]$mu
[1]  2.146797 54.833431

$clust_params[[1]]$sigma
[1] 13.41944


$clust_params[[2]]
$clust_params[[2]]$mu
[1]  4.317408 80.398192

$clust_params[[2]]$sigma
[1] 13.71741

gaussian - 具有高斯混合模型的潜在变量来估算缺失数据

1 回答 1

Related

Reference