r - 最大样本

Question

如果我想对数字进行采样以创建一个向量，我会这样做：

set.seed(123)
x <- sample(1:100,200, replace = TRUE)
sum(x)
# [1] 10228

如果我想采样 20 个总和为 100 的随机数，然后是 30 个数字，但总和仍为 100。我想这将比看起来更具挑战性。?sample并且搜索谷歌并没有为我提供线索。如果我猜想的总和不够接近（例如在 5 以内），那么一个循环采样然后拒绝可能需要一些时间。

有没有更好的方法来实现这一目标？

一个例子是：

foo(10,100) # ten random numbers that sum to 100. (not including zeros)
# 10,10,20,7,8,9,4,10,2,20

score 4 · Accepted Answer

使用 R 的尝试

# Config
n <- 20L
target <- 100L
vec <- seq(100)
set.seed(123)

# R repeat loop
sumto_repeat <- function(vec,n,target) {
  res <- integer()
  repeat {
    cat("begin:",sum(res),length(res),"\n")
    res <- c( res, sample(vec,1) )
    if( sum(res)<target & length(res)==(n-1) ) {
      res[length(res)+1] <- target - sum(res)
    }
    # cat("mid:",sum(res),length(res),"\n")
    if(sum(res)>target) res <- res[-length(res)]
    if( length(res)>n | length(res)<n & sum(res)==target ) {
      res <- res[-sample(seq(length(res)),1)]
    }
    # cat("end:",sum(res),length(res),"\n")
    # cat(dput(res),"\n")
    if( sum(res)==target & length(res)==n ) break
  }
  res
}

test <- sumto_repeat(vec=vec,n=n,target=target)
> sum(test)
[1] 100
> length(test)
[1] 20

另外，我会考虑一下您希望从中提取的分布。我认为有几种不同的方法可以使其target与n元素精确相加（例如，您可以使最后一个元素始终为target - sum(res)），这些元素可能具有或不具有不同的分布含义。

Rcpp 中一个非常相似的算法，用于 speeeeed！

cpp_src <- '
Rcpp::IntegerVector xa = clone(x); // Vector to be sampled
Rcpp::IntegerVector na(n); // Number of elements in solution
Rcpp::IntegerVector sa(s); // Sum of solution

int nsampled;
int currentSum;
int dropRandomIndex;
int numZeroes;
Rcpp::IntegerVector remainingQuantity(1);
int maxAttempts = 100;

// Create container for our results
Rcpp::IntegerVector res(maxAttempts);
std::fill( res.begin(), res.end(), NA_INTEGER );

// Calculate min/max so that we can draw random integers from within range
Rcpp::IntegerVector::iterator mn = std::min_element(xa.begin(), xa.end()) ;
Rcpp::IntegerVector::iterator mx = std::max_element(xa.begin(), xa.end()) ;
std::cout << "mx = " << *mx << std::endl;

// Now draw repeatedly
nsampled = 0;
for( int i = 0; i < maxAttempts; i++ ) {
  std::cout << "\\n" << i;
  int r = *mn + (rand() % (int)(*mx - *mn + 1));
  res[i] = xa[r+1];
  // Calculate n and s for current loop iteration
  numZeroes = 0;
  for( int j = 0; j < maxAttempts; j++) 
    if(res[j]==0) numZeroes++;
  std::cout << " nz= " << numZeroes ;
  nsampled = maxAttempts - sum( is_na(res) ) - numZeroes - 1;
  currentSum = std::accumulate(res.begin(),res.begin()+i,0); // Cant just use Rcpp sugar sum() here because it freaks at the NAs
  std::cout << " nsamp= " << nsampled << " sum= " << currentSum;
  if(nsampled == na[0]-1) {  
    std::cout << " One element away. ";
    remainingQuantity[0] = sa[0] - currentSum;
    std::cout << "remainingQuantity = " << remainingQuantity[0];
    if( (remainingQuantity[0] > 0) && (remainingQuantity[0]) < *mx ) {
      std::cout << "Within range.  Prepare the secret (cheating) weapon!\\n";
      std::cout << sa[0] << " ";
      std::cout << currentSum << " ";
      std::cout << remainingQuantity[0] << std::endl;
      if( i != maxAttempts ) {
        std::cout << "Safe to add one last element on the end.  Doing so.\\n";
        res[i] = remainingQuantity[0];
      }
      currentSum = sa[0];
      nsampled++;
      if(nsampled == na[0] && currentSum == sa[0]) std::cout << "It should end after this...nsamp= " << nsampled << " and currentSum= " << currentSum << std::endl;
      break;
    } else {
      std::cout << "Out of striking distance.  Dropping random element\\n";
      dropRandomIndex = 0 + (rand() % (int)(i - 0 + 1));
      res[dropRandomIndex] = 0;
    }
  }
  if(nsampled == na[0] && currentSum == sa[0]) {
      std::cout << "Success!\\n";
      for(int l = 0; l <= i+1; l++) 
        std::cout << res[l] << " " ;
      break;
  }
  if(nsampled == na[0] && currentSum != sa[0]) {
    std::cout << "Reached number of elements but sum is ";
    if(currentSum > sa[0]) {
      std::cout << "Too high. Blitz everything and start over!\\n";
      for(int k = 0; k < res.size(); k++) {
        res[k] = NA_INTEGER;
      }
    } else {
      std::cout << "Too low.  \\n";

    }
  }
  if( nsampled < na[0] && currentSum >= sa[0] ) {
    std::cout << "Too few elements but at or above the sum cutoff.  Dropping a random element and trying again.\\n";
    dropRandomIndex = 0 + (rand() % (int)(i - 0 + 1));
    res[dropRandomIndex] = 0;
  }
}
return res;
'

sumto <- cxxfunction( signature(x="integer", n="integer", s="integer"), body=cpp_src, plugin="Rcpp", verbose=TRUE )

testresult <- sumto(x=x, n=20L, s=1000L)
testresult <- testresult[!is.na(testresult)]
testresult <- testresult[testresult!=0]
testresult
cumsum(testresult)
length(testresult)

尝试了几个不同的值，并产生有效的答案，除非它跑掉。这里有一个警告，如果它与所需的元素数量相距一个并且在“惊人的距离”内，它就会作弊 - 例如，如果该数字有效，它会计算它而不是仅仅绘制最后一个值。

基准

有关比较代码，请参见要点。

score 3 · Accepted Answer

这是另一个尝试。它不使用sample，但使用runif. 我在显示总和的输出中添加了一个可选的“消息”，可以使用showSum参数触发。还有一个Tolerance参数指定需要多接近目标。

SampleToSum <- function(Target = 100, VecLen = 10, 
                        InRange = 1:100, Tolerance = 2, 
                        showSum = TRUE) {
  Res <- vector()
  while ( TRUE ) {
    Res <- round(diff(c(0, sort(runif(VecLen - 1)), 1)) * Target)
    if ( all(Res > 0)  & 
         all(Res >= min(InRange)) &
         all(Res <= max(InRange)) &
         abs((sum(Res) - Target)) <= Tolerance ) { break }
  }
  if (isTRUE(showSum)) cat("Total = ", sum(Res), "\n")
  Res
}

这里有些例子。

注意默认设置和设置之间的区别Tolerance = 0

set.seed(1)
SampleToSum()
# Total =  101 
#  [1] 20  6 11 20  6  3 24  1  4  6
SampleToSum(Tolerance=0)
# Total =  100 
#  [1] 19 15  4 10  1 11  7 16  4 13

您可以使用来验证此行为replicate。这是设置Tolerance = 0和运行函数 5 次的结果。

system.time(output <- replicate(5, SampleToSum(
  Target = 1376,
  VecLen = 13,
  InRange = 10:200,
  Tolerance = 0)))
# Total =  1376 
# Total =  1376 
# Total =  1376 
# Total =  1376 
# Total =  1376 
#    user  system elapsed 
#   0.144   0.000   0.145
output
#       [,1] [,2] [,3] [,4] [,5]
#  [1,]   29   46   11   43  171
#  [2,]  103  161  113  195  197
#  [3,]  145  134   91  131  147
#  [4,]  154  173  138   19   17
#  [5,]  197   62  173   11   87
#  [6,]  101  142   87  173   99
#  [7,]  168   61   97   40  121
#  [8,]  140  121   99  135  117
#  [9,]   46   78   31  200   79
# [10,]  140  168  146   17   56
# [11,]   21  146  117  182   85
# [12,]   63   30  180  179   78
# [13,]   69   54   93   51  122

设置Tolerance = 5和运行该功能5次也是如此。

system.time(output <- replicate(5, SampleToSum(
  Target = 1376,
  VecLen = 13,
  InRange = 10:200,
  Tolerance = 5)))
# Total =  1375 
# Total =  1376 
# Total =  1374 
# Total =  1374 
# Total =  1376 
#    user  system elapsed 
#   0.060   0.000   0.058 
output
#       [,1] [,2] [,3] [,4] [,5]
#  [1,]   65  190  103   15   47
#  [2,]  160   95   98  196  183
#  [3,]  178  169  134   15   26
#  [4,]   49   53  186   48   41
#  [5,]  104   81  161  171  180
#  [6,]   54  126   67  130  182
#  [7,]   34  131   49  113   76
#  [8,]   17   21  107   62   95
#  [9,]  151  136  132  195  169
# [10,]  194  187   91  163   22
# [11,]   23   69   54   97   30
# [12,]  190   14  134   43  150
# [13,]  156  104   58  126  175

毫不奇怪，将容差设置为 0 会使函数变慢。

速度（或缺乏）

请注意，由于这是一个“随机”过程，因此很难猜测找到正确的数字组合需要多长时间。例如，使用set.seed(123)，我连续运行了 3 次以下测试：

system.time(SampleToSum(Target = 1163,
                        VecLen = 15,
                        InRange = 50:150))

第一次运行只用了 9 秒多一点。第二次只用了 7.5 秒多一点。第三个用时……不到 381 秒！这是很多变化！

出于好奇，我在函数中添加了一个计数器，第一次运行尝试了55026次，以达到满足我们所有条件的向量！（我没有费心尝试第二次和第三次尝试。）

在函数中添加一些错误或健全性检查以确保输入是合理的可能会很好。例如，一个人不应该能够输入SampleToSum(Target = 100, VecLen = 10, InRange = 15:50)，因为在 15 到 50 的范围内，没有办法达到 100 并且向量中有 10 个值。

score 3 · Accepted Answer

假设您想要整数（如果不是，请查看 Dirichlet 分布），那么这可以被认为是一个球和瓮问题（对数字之间的关系没有进一步的限制）。

如果你想要 20 个数字，那么可以用 20 个瓮来表示。您希望数字总和为 100，即 100 个球。由于您需要正好 20 个数字（如果您想要最多 20 个数字，请跳过此步骤，但可能会更少）您首先在每个瓮中放置 1 个球，然后在瓮之间随机分配剩余的球。数一数每个瓮中的球数，您将得到 20 个数字，总和为 100。

作为R代码：

as.vector(table( c( 1:20, sample(1:20, 80, replace=TRUE) ) ))

as.vector只是剥离了表格类和标签。

快速、简单、精确、无循环、递归等。

对于其他总数或值的数量，只需更改上面的相应部分。

score 3 · Accepted Answer

另一种方法，但使用浮点数，所以不完全是您正在寻找的，抱歉：

randomsum <- function(nb, sum) {
  tmp <- sort(runif(nb-1))
  tmp <- c(min(tmp), diff(tmp), 1-max(tmp))
  as.vector(quantile(0:sum, probs=tmp))
}

例如：

R> result <- randomsum(10, 1000)
R> result
 [1]  35.282191  66.537308  17.263761 182.837409 120.064363 210.752735
 [7] 143.201079   6.164731  34.936359 182.960064
R> sum(result)
[1] 1000

您可以使用round结果来获取整数，但当然总和可能与您想要的结果略有不同。一个快速而肮脏的解决方法可以是更改其中一个随机值以使您的向量总和为您想要的数字：

randomsumint <- function(nb, sum) {
  tmp <- sort(runif(nb-1))
  tmp <- c(min(tmp), diff(tmp), 1-max(tmp))
  res <- as.vector(quantile(0:sum, probs=tmp))
  res <- round(res)
  res[length(res)] <- res[length(res)]+(sum-sum(res))
  res
}

这会给：

R> result <- randomsumint(10,1000)
R> result
 [1]  42 152   0  11  74 138   9 138 172 264
R> sum(result)
[1] 1000

并不是说这远非完美，因为在极少数情况下，您的结果可能会出现负值。

score 2 · Accepted Answer

我想到了组合数学中的星形、条形和分区：

foo <- function(n,total) {
  while(!exists("x",inherits=FALSE) || 1 %in% diff(x)) {
    x <- sort(c(0,sample.int(n+total,n-1,replace=FALSE),n+total))
  }
  print(x)
  sort(diff(x)-1)
}

另一种方法是使用分区包。这更适合枚举所有分区，但现在还可以。只要您的总数很小，它就可以工作。

require(partitions)
foo <- function(n,total) { 
  x <- restrictedparts(total,n,include.zero=FALSE)
  return(x[,sample.int(ncol(x),1)])
}

r - 最大样本

5 回答 5

速度（或缺乏）

Related

Reference