c++ - Best way to seed mt19937_64 for Monte Carlo simulations

Question

I'm working on a program that runs Monte Carlo simulation; specifically, I'm using a Metropolis algorithm. The program needs to generate possibly billions of "random" numbers. I know that the Mersenne twister is very popular for Monte Carlo simulation, but I would like to make sure that I am seeding the generator in the best way possible.

Currently I'm computing a 32-bit seed using the following method:

mt19937_64 prng; //pseudo random number generator
unsigned long seed; //store seed so that every run can follow the same sequence
unsigned char seed_count; //to help keep seeds from repeating because of temporal proximity

unsigned long genSeed() {
    return (  static_cast<unsigned long>(time(NULL))      << 16 )
         | ( (static_cast<unsigned long>(clock()) & 0xFF) << 8  )
         | ( (static_cast<unsigned long>(seed_count++) & 0xFF) );
}

//...

seed = genSeed();
prng.seed(seed);

I have a feeling there are much better ways to assure non-repeating new seeds, and I'm quite sure mt19937_64 can be seeded with more then 32-bits. Does anyone have any suggestions?

score 26 · Accepted Answer

用于std::random_device生成种子。如果您的实现支持它，它将提供不确定的随机数。否则允许使用其他一些随机数引擎。

std::mt19937_64 prng;
seed = std::random_device{}();
prng.seed(seed);

operator()ofstd::random_device返回 an unsigned int，因此如果您的平台有 32 位ints，并且您想要 64 位种子，则需要调用它两次。

std::mt19937_64 prng;
std::random_device device;
seed = (static_cast<uint64_t>(device()) << 32) | device();
prng.seed(seed);

另一个可用选项是std::seed_seq用于播种 PRNG。这允许 PRNG 调用，这会在[0 ≤ i < 2 ³² )seed_seq::generate范围内产生一个无偏序列，其输出范围足够大以填充其整个状态。

std::mt19937_64 prng;
std::random_device device;
std::seed_seq seq{device(), device(), device(), device()};
prng.seed(seq);

我调用了random_device4 次来为seed_seq. 但是，就初始序列中元素的长度或来源而言，我不确定最佳实践是什么。

score 6 · Accepted Answer

让我们回顾一下（也有评论），我们想要生成不同的种子，以在以下每个事件中获得独立的随机数序列：

该程序稍后在同一台机器上重新启动，
两个线程同时在同一台机器上启动，
该程序同时在两台不同的机器上启动。

1 使用自纪元以来的时间求解，2 使用全局原子计数器求解，3 使用依赖于平台的 id 求解（请参阅如何以跨平台方式获取（几乎）唯一系统标识符？）

现在的重点是结合它们以获得uint_fast64_t（种子类型std::mt19937_64）的最佳方法是什么？我在这里假设我们不知道每个参数的先验范围或它们太大，因此我们不能仅仅使用位移来以微不足道的方式获得唯一的种子。

Astd::seed_seq是最简单的方法，但是它的返回类型uint_least32_t不是我们的最佳选择。

一个好的 64 位散列器是一个更好的选择。STLstd::hash在functional标题下提供了一种可能是将上面的三个数字连接成一个字符串，然后将其传递给哈希器。返回类型是size_t在 64 台机器上很可能符合我们要求的类型。

碰撞不太可能发生，但当然有可能，如果您想确保不建立包含多次序列的统计信息，您只能存储种子并丢弃重复的运行。

Astd::random_device也可以用于生成种子（碰撞仍然可能发生，很难说是否经常发生），但是由于实现依赖于库并且可能会下降到伪随机生成器，因此必须检查熵该设备并避免为此使用零熵设备，因为您可能会破坏上述几点（尤其是第 3 点）。不幸的是，只有当您将程序带到特定机器并使用已安装的库进行测试时，您才能发现熵。

score 4 · Accepted Answer

据我从您的评论中可以看出，您似乎感兴趣的是确保如果一个过程同时开始您的多个模拟，它们将获得不同的种子。

我可以看到您当前方法的唯一重要问题是竞争条件：如果您要同时启动多个模拟，则必须从单独的线程中完成。如果它是从单独的线程完成的，则需要以seed_count线程安全的方式进行更新，否则多个模拟可能会以相同的seed_count. 你可以简单地std::atomic<int>解决这个问题。

除此之外，它似乎比它必须的更复杂。使用两个独立的计时器有什么好处？你可以做这样简单的事情：

在程序启动时，获取当前系统时间（使用高分辨率计时器）一次，然后存储。
为每个模拟分配一个唯一的 ID（这可能只是一个初始化为 0 的整数，（如上所述，它应该在没有任何竞争条件的情况下生成），每次模拟开始时都会递增，实际上就像你的seed_count.
播种模拟时，只需使用最初生成的时间戳 + 唯一 ID。如果您这样做，则可以确保过程中的每个模拟都有一个独特的种子。

score 1 · Accepted Answer

怎么样...

有一些主代码启动线程，并且在这些线程中运行函数的副本，每个副本都有自己的 Marsenne Twister。我对么？如果是这样，为什么不在主代码中使用另一个随机生成器呢？它将以时间戳为种子，并将其连续的伪随机数发送给函数实例作为它们的种子。

score 1 · Accepted Answer

From the comments I understand you want to run several instances of the algorithm, one instance per thread. And given that the seed for each instance will be generated pretty much at the same time, you want to ensure that these seeds are different. If that is indeed what you are trying to solve, then your genSeed function will not necessarily guarantee that.

In my opinion, what you need is a parallelisable random number generator (RNG). What this means, is that you only need one RNG which you instantiate with only one seed (which you can generate with your genSeed) and then the sequence of random numbers that would normally be gerenated in a sequential environment is split in X non-overlapping sequences; where X is the number of threads. There is a very good library which provides these type of RNGs in C++, follows the C++ standard for RNGs, and is called TRNG(http://numbercrunch.de/trng).

Here is a little more information. There are two ways you can achieve non-overlapping sequences per thread. Let's assume that the sequence of random numbers from a single RNG is r = {r(1), r(2), r(3),...} and you have only two threads. If you know in advance how many random numbers you will need per thread, say M, you can give the first M of the r sequence to the first thread, ie {r(1), r(2),..., r(M)}, and the second M to the second thread, ie {r(M+1), r(M+2),...r(2M)}. This technique is called blocksplitting since you split the sequence in two consecutive blocks.

The second way is to create the sequence for the first thread as {r(1), r(3), r(5), ...} and for the second thread as {r(2), r(4), r(6),...}, which has the advantage that you do not need to know in advance how many random numbers you will need per thread. This is called leapfroging.

Note that both methods guarantee that the sequences per thread are indeed non-overlapping. The link I posted above has many examples and the library itself is extremely easy to use. I hope my post helps.

score 0 · Accepted Answer

POSIX 函数gettimeofday(2)以微秒精度给出时间。

POSIX 线程函数gettid(2)返回当前线程的 ID 号。

您应该能够将自纪元（您已经在使用）以来的时间（以秒为单位）、以微秒为单位的时间和线程 ID 结合起来，以获得在一台机器上始终唯一的种子。

如果您还需要它在多台机器上是唯一的，您可以考虑获取主机名、IP 地址或 MAC 地址。

我猜想 32 位可能就足够了，因为有超过 40 亿个独特的种子可用。除非您正在运行数十亿个进程，这似乎不太可能，否则您应该没问题，无需使用 64 位种子。

c++ - Best way to seed mt19937_64 for Monte Carlo simulations

6 回答 6

Related

Reference