r - 从 r 中的两个高斯混合生成样本（MATLAB 中给出的代码）

Question

我正在尝试创建（在 r 中）等效于以下 MATLAB 函数，该函数将从 N(m1,(s1)^2) 和 N(m2, (s2)^2) 的混合中生成 n 个样本，其中包含一个分数，alpha，来自第一个高斯。

我有一个开始，但是 MATLAB 和 R 之间的结果明显不同（即，MATLAB 结果偶尔给出 +-8 的值，但 R 版本甚至从未给出 +-5 的值）。请帮我解决这里有什么问题。谢谢：-）

例如：从 N(0,1) 和 N(0,36) 的混合中绘制 1000 个样本，其中 95% 的样本来自第一个高斯。将样本归一化为均值零和标准差一。

MATLAB

功能

function y = gaussmix(n,m1,m2,s1,s2,alpha)
y = zeros(n,1);
U = rand(n,1);
I = (U < alpha)
y = I.*(randn(n,1)*s1+m1) + (1-I).*(randn(n,1)*s2 + m2);

执行

P = gaussmix(1000,0,0,1,6,.95)
P = (P-mean(P))/std(P)
plot(P)
axis([0 1000 -15 15])
hist(P)
axis([-15 15 0 1000])

结果图

从 MATLAB 中的两个高斯分布中随机生成的样本图

结果历史

从 MATLAB 中的两个高斯分布随机生成的样本的直方图

R

yn <- rbinom(1000, 1, .95)
s <- rnorm(1000, 0 + 0*yn, 1 + 36*yn)
sn <- (s-mean(s))/sd(s)
plot(sn, xlim=range(0,1000), ylim=range(-15,15))
hist(sn, xlim=range(-15,15), ylim=range(0,1000))

结果图

从 R 中的两个高斯分布中随机生成的样本图

结果历史

从 R 中的两个高斯分布随机生成的样本的直方图

一如既往，谢谢！

解决方案

gaussmix <- function(nsim,mean_1,mean_2,std_1,std_2,alpha){
   U <- runif(nsim)
   I <- as.numeric(U<alpha)
   y <- I*rnorm(nsim,mean=mean_1,sd=std_1)+
       (1-I)*rnorm(nsim,mean=mean_2,sd=std_2)
   return(y)
}

z1 <- gaussmix(1000,0,0,1,6,0.95)
z1_standardized <- (z1-mean(z1))/sqrt(var(z1))
z2 <- gaussmix(1000,0,3,1,1,0.80)
z2_standardized <- (z2-mean(z2))/sqrt(var(z2))
z3 <- rlnorm(1000)
z3_standardized <- (z3-mean(z3))/sqrt(var(z3))

par(mfrow=c(2,3))
hist(z1_standardized,xlim=c(-10,10),ylim=c(0,500),
   main="Histogram of 95% of N(0,1) and 5% of N(0,36)",
   col="blue",xlab=" ")
hist(z2_standardized,xlim=c(-10,10),ylim=c(0,500),
   main="Histogram of 80% of N(0,1) and 10% of N(3,1)",
   col="blue",xlab=" ")
hist(z3_standardized,xlim=c(-10,10),ylim=c(0,500),
   main="Histogram of samples of LN(0,1)",col="blue",xlab=" ")
##
plot(z1_standardized,type='l',
   main="1000 samples from a mixture N(0,1) and N(0,36)",
   col="blue",xlab="Samples",ylab="Mean",ylim=c(-10,10))
plot(z2_standardized,type='l',
   main="1000 samples from a mixture N(0,1) and N(3,1)",
   col="blue",xlab="Samples",ylab="Mean",ylim=c(-10,10))
plot(z3_standardized,type='l',
  main="1000 samples from LN(0,1)",
   col="blue",xlab="Samples",ylab="Mean",ylim=c(-10,10))

score 6 · Accepted Answer

我认为有两个问题......（1）您的 R 代码正在创建标准偏差为 1 和 37的正态分布的混合。(2) 通过在调用中设置prob等于 alpha ，您将在第二种模式而不是第一种模式下rbinom()获得分数 alpha 。所以你得到的是一个分布，它主要是一个 sd 37 的高斯分布，被 5% 的高斯和 sd 1 的混合污染，而不是一个 sd 1 的高斯分布，被 5% 的高斯和 sd 6 的混合污染. 通过混合物的标准偏差（大约为 36.6）进行缩放，基本上将其降低为标准高斯，在原点附近有轻微的凸起......

（此处发布的其他答案确实很好地解决了您的问题，但我认为您可能对诊断感兴趣......）

您的 Matlab 函数的更紧凑（也许更惯用）版本gaussmix（我认为runif(n)<alpha比更有效rbinom(n,size=1,prob=alpha)）

gaussmix <- function(n,m1,m2,s1,s2,alpha) {
    I <- runif(n)<alpha
    rnorm(n,mean=ifelse(I,m1,m2),sd=ifelse(I,s1,s2))
}
set.seed(1001)
s <- gaussmix(1000,0,0,1,6,0.95)

score 2 · Accepted Answer

并不是您要求它，而是该mclust软件包提供了一种将您的问题推广到更多维度和不同协方差结构的方法。见?mclust::sim。示例任务将以这种方式完成：

require(mclust)
simdata = sim(modelName = "V",
              parameters = list(pro = c(0.95, 0.05),
                                mean = c(0, 0),
                                variance = list(modelName = "V", 
                                                d = 1, 
                                                G = 2,
                                                sigmasq = c(0, 36))),
              n = 1000)
plot(scale(simdata[,2]), type = "h")

score 1 · Accepted Answer

我最近写了正态分布的多项混合的密度和采样函数：

dmultiNorm <- function(x,means,sds,weights)
{
  if (length(means)!=length(sds)) stop("Length of means must be equal to length of standard deviations")
  N <- length(x)
  n <- length(means)
  if (missing(weights))
  {
    weights <- rep(1,n)  
  }
  if (length(weights)!=n) stop ("Length of weights not equal to length of means and sds")
  weights <- weights/sum(weights)
  dens <- numeric(N)
  for (i in 1:n)
  {
    dens <- dens + weights[i] * dnorm(x,means[i],sds[i])
  }
  return(dens)
}

rmultiNorm <- function(N,means,sds,weights,scale=TRUE)
{
  if (length(means)!=length(sds)) stop("Length of means must be equal to length of standard deviations")
  n <- length(means)
  if (missing(weights))
  {
    weights <- rep(1,n)  
  }
  if (length(weights)!=n) stop ("Length of weights not equal to length of means and sds")

  Res <- numeric(N)
  for (i in 1:N)
  {
    s <- sample(1:n,1,prob=weights)
    Res[i] <- rnorm(1,means[s],sds[s])  
  }
  return(Res)
}

作为means均值sds向量，作为标准偏差向量，并且weights作为具有从每个分布中采样的比例概率的向量。这对你有用吗？

score 1 · Accepted Answer

这是执行此任务的代码：

“例如：从 N(0,1) 和 N(0,36) 的混合中绘制 1000 个样本，其中 95% 的样本来自第一个高斯。将样本归一化为均值零和标准差一。”

 plot(multG <- c( rnorm(950), rnorm(50, 0, 36))[sample(1000)] , type="h")
 scmulG <- scale(multG)
 summary(scmulG)
 #-----------    
   V1          
 Min.   :-9.01845  
 1st Qu.:-0.06544  
 Median : 0.03841  
 Mean   : 0.00000  
 3rd Qu.: 0.13940  
 Max.   :12.33107

在此处输入图像描述

r - 从 r 中的两个高斯混合生成样本（MATLAB 中给出的代码）

4 回答 4

Related

Reference