r - 以自定义分布为条件模拟向量

Question

我正在测量每天的剧集持续时间（ep.dur以分钟为单位的向量），为期数T=364天的观察期。该向量ep.dur有一个length(ep.dur)，T=364在没有发生事件的天数中为零，并且range(ep.dur)在 0 到 1440 之间

T 周期内的情节持续时间之和为a<-sum(ep.duration)

现在我有一个向量den，带有length(den)=99. 向量 den 显示每 1% (1%, 2%, 3%, ...) 的开发需要多少天a

现在给出 denand a，我想模拟多个ep.dur

这可能吗？

澄清 1: : (danas.zuokas 的第一条评论) 的元素den代表持续时间而不是确切的天数。这意味着，例如 1，1a 天开发 1%(=1195.8)，2 天开发 2%，3 天开发 3%，4 天开发 4%，5 天开发 5%，5天开发6% ......）。这些剧集可以在 T 中的任何地方发生

澄清 2：（danas.zuokas 的第二条评论）不幸的是，无法假设持续时间如何发展。这就是为什么我必须模拟大量 ep.dur 向量。但是，如果这有任何帮助，我可以将 den 向量扩展为更有限的分辨率（即：而不是 1% 的跳跃，0.1% 的跳跃）。

算法描述 算法应该满足 den 向量提供的所有信息。我想象算法如下（示例 3）： a 的每 1% 跳跃是 335,46 分钟。den[1]告诉我们 1% 的 a 是在 1 天内开发的。所以假设我们生成ep.dur[1]=335,46。好的。我们去den[2]： 2% 的 a 是在d[2]=1 天内开发的。因此，ep.dur[1]不能为 335,46 并被拒绝（一天内仍应出现 2% 的 a）。可以说产生了ep.dur[1]= 1440。d[1]满意，满意d[2]（至少 2% 的总持续时间在dur[2]=1 天内开发），dur[3]=1 也满意。守门员？但是，dur[4]如果 ep.dur[1]=1440 不满足 =2，因为它表明 4% 的 a (=1341) 应该在 2 天内发生。所以ep.dur[1]被拒绝。现在让我们说ep.dur[1]= 1200。dur[1:3]被接受。然后我们生成ep.dur[2]等等，确保生成的 ep.dur 都满足 den 提供的信息。

这在编程上可行吗？我真的不知道从哪里开始这个问题。一旦赏金开始期结束，我将提供慷慨的赏金

示例 1：

a<-119508

den<-c(1, 2, 3, 4, 5, 5, 6, 7, 8, 9, 10, 10, 11, 12, 13, 14, 15, 15, 
                16, 17, 18, 19, 20, 20, 21, 22, 23, 24, 25, 25, 26, 27, 28, 29, 
                30, 30, 31, 32, 33, 34, 35, 35, 36, 37, 38, 39, 40, 40, 41, 42, 
                43, 44, 45, 45, 46, 47, 48, 49, 50, 50, 51, 52, 53, 54, 55, 55, 
                56, 57, 58, 59, 60, 60, 61, 62, 63, 64, 65, 65, 66, 67, 68, 69, 
                70, 70, 71, 72, 73, 74, 75, 75, 76, 77, 78, 79, 80, 80, 81, 82, 
                83)

示例 2：

   a<-78624
    den<-c(1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 
    11, 12, 13, 13, 14, 14, 15, 15, 16, 16, 17, 18, 19, 21, 22, 23, 
    28, 32, 35, 36, 37, 38, 43, 52, 55, 59, 62, 67, 76, 82, 89, 96, 
    101, 104, 115, 120, 126, 131, 134, 139, 143, 146, 153, 160, 165, 
    180, 193, 205, 212, 214, 221, 223, 227, 230, 233, 234, 235, 237, 
    239, 250, 253, 263, 269, 274, 279, 286, 288, 296, 298, 302, 307, 
    309, 315, 320, 324, 333, 337, 342, 347, 352)

示例 3

a<-33546
den<-c(1, 1, 1, 2, 4, 6, 8, 9, 12, 15, 17, 21, 25, 29, 31, 34, 37, 
42, 45, 46, 51, 52, 56, 57, 58, 59, 63, 69, 69, 71, 76, 80, 81, 
87, 93, 95, 102, 107, 108, 108, 112, 112, 118, 123, 124, 127, 
132, 132, 132, 135, 136, 137, 150, 152, 162, 166, 169, 171, 174, 
176, 178, 184, 189, 190, 193, 197, 198, 198, 201, 202, 203, 214, 
218, 219, 223, 225, 227, 238, 240, 246, 248, 251, 254, 255, 257, 
259, 260, 277, 282, 284, 285, 287, 288, 290, 294, 297, 321, 322, 
342)

示例 4

    a<-198132

den<-c(2, 3, 5, 6, 7, 9, 10, 12, 13, 14, 16, 17, 18, 20, 21, 23, 24, 
    25, 27, 28, 29, 31, 32, 34, 35, 36, 38, 39, 40, 42, 43, 45, 46, 
    47, 49, 50, 51, 53, 54, 56, 57, 58, 60, 61, 62, 64, 65, 67, 68, 
    69, 71, 72, 74, 75, 76, 78, 79, 80, 82, 83, 85, 86, 87, 89, 90, 
    91, 93, 94, 96, 97, 98, 100, 101, 102, 104, 105, 107, 108, 109, 
    111, 112, 113, 115, 116, 120, 123, 130, 139, 155, 165, 172, 176, 
    178, 181, 185, 190, 192, 198, 218)

score 3 · Accepted Answer

据我了解您所追求的，我将从转换den为rle对象开始。（此处使用示例 3中的数据）

编辑：在第 364 天添加 100%den

if(max(den)!=364) den <- c(den, 364)
(rleDen <- rle(den))
# Run Length Encoding
#   lengths: int [1:92] 3 1 1 1 1 1 1 1 1 1 ...    # 92 intervals
#   values : num [1:92] 1 2 4 6 8 9 12 15 17 21 ...
percDur <- rleDen$lengths            # Percentage of total duration in each interval
atDay <- rleDen$values               # What day that percentage was reached
intWidth <- diff(c(0, atDay), k = 1) # Interval width
durPerDay <- 1440                    # Max observation time per day
percPerDay <- durPerDay/a*100        # Max percentage per day
cumPercDur <- cumsum(percDur)        # Cumulative percentage in each interval
maxPerInt <- pmin(percPerDay * diff(c(0, atDay), 1),
  percDur + 1)                       # Max percent observation per interval

set.seed(1)
nsims <- 10                          # Desired number of simulations
sampMat <- matrix(0, ncol = length(percDur), nrow = nsims) # Matrix to hold sim results

考虑到每天最多观察 1440 分钟的限制，为了允许随机性，请检查是否存在任何长间隔（即，在该间隔内不能完全实现百分比跳跃的任何间隔）

if(any(percDur > maxPerInt)){
  longDays <- percDur > maxPerInt
  morePerInt <- maxPerInt - percDur
  perEnd <- c(which(diff(longDays,1) < 0), length(longDays))
# Group intervals into periods bounded by "long" days
# and determine if there are any long periods (i.e., where
# the jump in percentage can't be achieved in that period)
  perInd <- rep(seq_along(perEnd), diff(c(0, perEnd)))
  perSums <- tapply(percDur, perInd, sum)
  maxPerPer <- tapply(maxPerInt, perInd, sum)
  longPers <- perSums > maxPerPer
# If there are long periods, determine, starting with the last period, when the
# excess can be covered. Each group of periods is recorded in the persToWatch
# object
  if(any(longPers)) {
    maxLongPer <- perEnd[max(which(longPers))]
    persToWatch <- rep(NA, length(maxLongPer))
    for(kk in rev(seq_len(maxLongPer))) {
      if(kk < maxLongPer && min(persToWatch, na.rm = TRUE) <= kk) next
        theSums <- cumsum(morePerInt[order(seq_len(kk),
          decreasing = TRUE)])
        above0 <- which(rev(theSums) > 0)
        persToWatch[kk] <- max(above0[which(!perInd[above0] %in% c(perInd[kk],
          which(longPers)) & !above0 %in% which(longDays))])
    }
  }
}

现在我们可以开始随机性了。抽样的第一个a组成部分决定了在每个间隔中发生的总体比例。多少？让我们runif决定。上限和下限必须反映每天的最长观察时间以及任何长日和时段的超量

  for(jj in seq_along(percDur[-1])) {
    upperBound <- pmin(sampMat[, jj] + maxPerInt[jj],
      cumPercDur[jj] + 1)
    lowerBound <- cumPercDur[jj]
# If there are long days, determine the interval over which the
# excess observation time may be spread
    if(any(percDur > maxPerInt) && any(which(longDays) >= jj)) {
      curLongDay <- max(which(perInd %in% perInd[jj]))
      prevLongDay <- max(0, min(which(!longDays)[which(!longDays) <= jj]))
      curInt <- prevLongDay : curLongDay
# If there are also long periods, determine how much excess observation time there is
      if(any(longPers) && maxLongPer >= jj) {
        curLongPerHigh <- min(which(!is.na(persToWatch))[
          which(!is.na(persToWatch)) >= jj])
        curLongPerLow <- persToWatch[curLongPerHigh]
        longInt <- curLongPerLow : curLongPerHigh
        curExtra <- max(0,
          cumPercDur[curLongPerHigh] - 
          sum(maxPerInt[longInt[longInt > jj]]) - 
          sampMat[, jj, drop = FALSE])
      } else {
        curExtra <- cumPercDur[curLongDay] - 
          (sum(maxPerInt[curInt[curInt > jj]]) +
          sampMat[, jj, drop = FALSE])
      }
# Set the lower limit for runif appropriately
      lowerBound <- sampMat[, jj, drop = FALSE] + curExtra
    }
# There may be tolerance errors when the observations are tightly
# packed
    if(any(lowerBound - upperBound > 0)) { 
      if(all((lowerBound - upperBound) <= .Machine$double.eps*2*32)) {
        upperBound <- pmax(lowerBound, upperBound)
      } else {
        stop("\nUpper and lower bounds are on the wrong side of each other\n",
          jj,max(lowerBound - upperBound))
      }
    }
    sampMat[, jj + 1] <- runif(nsims, lowerBound, upperBound)
  }

然后将 100% 添加到结果的末尾并计算特定于区间的百分比

  sampMat2 <- cbind(sampMat[, seq_along(percDur)], 100)
  sampPercDiff <- t(apply(sampMat2, 1, diff, k = 1))

随机性的第二个sampPercDiff分量决定了区间宽度上的分布intWidth。在我看来，这仍然需要更多的思考。例如，与所考虑的时间单位相比，一个典型的情节持续了多长时间？

对于每个间隔，确定是否需要在多个时间单位（在本例中为天）分配随机百分比。编辑：更改以下代码以限制intWidth > 1.

library(foreach)
  ep.dur<-foreach(ii = seq_along(intWidth),.combine=cbind)%do%{
    if(intWidth[ii]==1){
      ret<-sampPercDiff[, ii, drop = FALSE] * a / 100
      dimnames(ret)<-list(NULL,atDay[ii])
      ret
    } else {
      theDist<-matrix(numeric(0), ncol = intWidth[ii], nrow = nsims)
      for(jj in seq_len(intWidth[ii]-1)){
        theDist[, jj] <- floor(runif(nsims, 0, pmax(0,
          min(sampPercDiff[, ii], floor(sampMat2[,ii + 1])-.Machine$double.eps -
          sampMat2[,ii]) * a / 100 - rowSums(theDist, na.rm = TRUE))))
      }
      theDist[, intWidth[ii]] <- sampPercDiff[, ii] * a / 100 - rowSums(theDist,
        na.rm = TRUE)
      distOrder <- replicate(nsims, c(sample.int(intWidth[ii] - 1),
        intWidth[ii]), simplify = FALSE)
      ret <- lapply(seq_len(nrow(theDist)), function(x) {
        theDist[x, order(distOrder[[x]])]
      })
      ans <- do.call(rbind, ret)
      dimnames(ans) <- list(NULL, atDay[ii]-((intWidth[ii]:1)-1))
      ans
    }
  }

持续时间在要分配的时间间隔内为每个时间单位（天）随机采样。在将总持续时间分解为每日观察时间后，然后将这些时间随机分配给间隔中的天数。

然后，将采样和分布的百分比乘以a并除以 100

ep.dur[1, 1 : 6]
#         1         2         3         4         5         6 
# 1095.4475  315.4887    1.0000  578.9200   13.0000  170.6224 

ncol(ep.dur)
# [1] 364

apply(ep.dur, 1, function(x) length(which(x == 0)))
# [1] 131 133 132 117 127 116 139 124 124 129

rowSums(ep.dur)/a
# [1] 1 1 1 1 1 1 1 1 1 1

plot(ep.dur[1, ], type = "h", ylab = "obs time")

甚至更新的样品

score 3 · Accepted Answer

我很可能会使用 ruby 脚本来执行此操作，但也可以这样做R。我不确定这是否是你的作业问题。至于回答你的问题：这可以有问题吗？是的当然！

根据您的问题，我的解决方案是定义最小和最大限制，我可以在其中随机选择一个满足den向量和a值给定条件的百分比。

由于den向量仅包含 99% 的值，我们无法确定 100% 何时会发生。这种情况使我的解决方案分为 3 部分 - 1）对于给定的 den 向量高达 98% 2）对于 99% 3）超过 99%。我可以定义另一个函数并将通用代码放在所有这 3 个部分中，但我没有这样做。

因为，我使用runif命令生成随机数，给定下限，它不太可能生成确切的下限值。因此，我定义了一个threshold我可以检查的值，如果它低于它，我会将其设为 0。你可以拥有它或删除它。此外，当您考虑示例 4 时，前 1% 将在第二天发生。所以这意味着第 1 天最多可以包含剧集的 0.999999%，然后 1% 发生在第 2 天。smallestdiff这就是为什么通过减去一个可以更改的值来定义最大限制的原因。

FindMinutes=function(a,den){
  if (a>1440*364){
    Print("Invalid value for aa")
    return("Invalid value for aa")
  }
  threshold=1E-7
  smallestdiff=1E-6
  sum_perc=0.0
  start=1 #day 1
  min=0 #minimum percentage value for a day
  max=0 #maximum percentage value for a day
  days=rep(c(0),364) #day vector with percentage of minutes - initialized to 0

  maxperc=1440*100/a #maximum percentage wrto 1440 minutes/day

  #############################################################
  #############################################################
  ############ For the length of den vector ###################
  for (i in 1:length(den)){
    if (den[i]>start){   
      min=(i-1)-sum_perc
      for(j in start:(den[i]-1)){#number of days in-between
         if (j>start){ min=0 }
         if (i-smallestdiff-sum_perc>=maxperc){
           max=maxperc
           if ((i-smallestdiff-sum_perc)/(den[i]-j)>=maxperc){
              min=maxperc
           }else{
              if ((i-smallestdiff-sum_perc)/(den[i]-j-1)<maxperc){
                 min=maxperc-(i-smallestdiff-sum_perc)/(den[i]-j-1)
               }else{
                 min=maxperc
               }           
           }
         }else{     
           max=i-smallestdiff-sum_perc
         }  

         if ((r=runif(1,min,max))>=threshold){
            days[j]=r
            sum_perc=sum_perc+days[j]
         }else{
            days[j]=0.0
         }
      }
      start=den[i]
    }
  }
  #############################################################
  #############################################################
  #####################For the 99% ############################
  min=99-sum_perc
  for(j in start:den[length(den)]){
    if (j>start){
           min=0
    }
    max=100-sum_perc
    if (100-sum_perc>=maxperc){
        max=maxperc
        if ((100-sum_perc)/(364+1-j)>=maxperc){
            min=maxperc
        }else{
            if ((100-sum_perc)/(364-j)<maxperc){
               min=maxperc-(100-sum_perc)/(364-j)
            }else{
               min=maxperc
            }           
        }
    }else{
        max=100-sum_perc
    }
    if ((r=runif(1,min,max))>=threshold){
        days[j]=r
        sum_perc=sum_perc+days[j]
    }else{
        days[j]=0.0
    }
  }
  #############################################################
  #############################################################
  ##################### For the remaining 1%###################
  min=0
  for(j in den[length(den)]+1:364){
      max=100-sum_perc
      if (j==364){
        min=max
        days[j]=min      
      }else{
        if (100-sum_perc>maxperc){
           max=maxperc
           if ((100-sum_perc)/(364+1-j)>=maxperc){
              min=maxperc
           }else{
              if ((100-sum_perc)/(364-j)<maxperc){
                 min=maxperc-(100-sum_perc)/(364-j)
               }else{
                 min=maxperc
               }           
           }
        }else{
           max=100-sum_perc
        }
        if ((r=runif(1,min,max))>=threshold){
           days[j]=r
        }else{
           days[j]=0.0
        }
    }
    sum_perc=sum_perc+days[j]  
    if (sum_perc>=100.00){
       break
    }  
  }
  return(days*a/100) #return as minutes vector corresponding to each 364 days
}#function

在我的代码中，我根据最小值和最大值随机生成每天的剧集百分比值。此外，den当您将百分比值四舍五入为整数（days向量）时，条件（向量）仍然有效，但如果您需要，您可能需要额外的调整（这取决于den进一步检查向量，然后重新调整百分比的最小值）精确到小数点后几位。您还可以检查以确保sum(FindMinutes(a,den))等于a。如果您想以den0.1% 的形式定义，您可以这样做，但您需要更改相应的方程式（inmin和max）

作为最坏情况的示例，如果您将a其设为可以采用的最大值和相应的den向量：

a=1440*364
den<-c(0)
cc=1
for(i in 1:363){
 if (trunc(i*1440*100/(1440*364))==cc){
  den[cc]=i
  cc=cc+1
 }
}

您可以通过调用函数来运行上面的示例：maxexamplemin=FindMinutes(a,den) 您可以检查所有天的最大分钟数为 1440，这是这里唯一可能的情况。

作为说明，让我运行您的示例 3：

a<-33546
den<-c(1, 1, 1, 2, 4, 6, 8, 9, 12, 15, 17, 21, 25, 29, 31, 34, 37, 42, 45, 46, 51, 52, 56, 57, 58, 59, 63, 69, 69, 71, 76, 80, 81, 87, 93, 95, 102, 107, 108, 108, 112, 112, 118, 123, 124, 127, 132, 132, 132, 135, 136, 137, 150, 152, 162, 166, 169, 171, 174, 176, 178, 184, 189, 190, 193, 197, 198, 198, 201, 202, 203, 214, 218, 219, 223, 225, 227, 238, 240, 246, 248, 251, 254, 255, 257, 259, 260, 277, 282, 284, 285, 287, 288, 290, 294, 297, 321, 322, 342)
rmin=FindMinutes(a,den)
sum(rmin)
[1] 33546
rmin2=FindMinutes(a,den)
rmin3=FindMinutes(a,den)
plot(rmin,tpe="h")
par(new=TRUE)
plot(rmin2,col="red",type="h")
par(new=TRUE)
plot(rmin3,col="red",type="h")

和 3 个叠加图如下所示：示例 3 的 3 次模拟的叠加图

r - 以自定义分布为条件模拟向量

2 回答 2

Related

Reference