c++ - C++ Intel TBB 和 Microsoft PPL，如何在并行循环中使用 next_permutation？

Question

我有安装了英特尔并行工作室 2013 的 Visual Studio 2012，所以我有英特尔 TBB。

假设我有以下代码：

const int cardsCount = 12; // will be READ by all threads
// the required number of cards of each colour to complete its set:
// NOTE that the required number of cards of each colour is not the same as the total number of cards of this colour available
int required[] = {2,3,4}; // will be READ by all threads
Card cards[cardsCount]; // will be READ by all threads
int cardsIndices[cardsCount];// this will be permuted, permutations need to be split among threads !

// set "cards" to 4 cards of each colour (3 colours total = 12 cards)
// set cardsIndices to {0,1,2,3...,11}

 // this variable will be written to by all threads, maybe have one for each thread and combine them later?? or can I use concurrent_vector<int> instead !?
int logColours[] = {0,0,0};

int permutationsCount = fact(cardsCount);

for (int pNum=0; pNum<permutationsCount; pNum++) // I want to make this loop parallel !!
{
    int countColours[3] = {0,0,0}; // local loop variable, no problem with multithreading
    for (int i=0; i<cardsCount; i++)
    {
        Card c = cards[cardsIndices[i]]; // accessed "cards"

        countColours[c.Colour]++; // local loop variable, np.
            // we got the required number of cards of this colour to complete it
        if (countColours[c.Colour] == required[c.Colour]) // read global variable "required" !
        {
                    // log that we completed this colour and go to next permutation
            logColours[c.Colour] ++; // should I use a concurrent_vector<int> for this shared variable?
            break;
        }
    }
    std::next_permutation(cardsIndices, cardsIndices+cardsCount); // !! this is my main issue
}

我正在计算的是，如果我们从可用的卡片中随机挑选，我们将完成多少次颜色，这是通过遍历每个可能的排列并按顺序挑选来彻底完成的，当一种颜色“完成”时，我们会打破并进入下一个排列。请注意，我们有每种颜色的 4 张卡片，但完成每种颜色所需的卡片数量是 {2,3,4}（红、绿、蓝）。2 张红色卡足以完成红色，我们有 4 张可用，因此红色比蓝色更可能完成，蓝色需要选择所有 4 张卡。

我想让这个for循环并行，但我的主要问题是如何处理“卡片”排列？你在这里有大约 5 亿个排列（12 个！），如果我有 4 个线程，我怎么能把它分成 4 个不同的部分，让每个线程都通过它们？

如果我不知道机器的内核数，我希望程序自动选择正确的并发线程数怎么办？肯定有办法使用英特尔或微软工具来做到这一点吗？

这是我的 Card 结构以防万一：

struct Card
{
public:
    int Colour;
    int Symbol;
}

score 2 · Accepted Answer

让N = cardsNumber, M = required[0] * required[1] * ... * required[maxColor]. 然后，实际上，您的问题可以在 O(N * M) 时间内轻松解决。就您而言，这就是12 * 2 * 3 * 4 = 288操作。:)

一种可能的方法是使用递归关系。考虑一个函数logColours f(n, required)。设为n当前已考虑的卡片数量；required是您示例中的向量。函数以向量的形式返回答案logColours。您对f(12, {2,3,4}). 函数内部的简单循环计算f可以这样写：

std::vector<int> f(int n, std::vector<int> require) {
    if (cache[n].count(require)) {
        // we have already calculated function with same arguments, do not recalculate it again
        return cache[n][require];
    }

    std::vector<int> logColours(maxColor, 0); // maxColor = 3 in your example

    for (int putColor=0; putColor<maxColor; ++putColor) {
         if (/* there is still at least one card with color 'putColor'*/) {
              // put a card of color 'putColor' on place 'n'
              if (require[putColor] == 1) {
                  // means we've reached needed amount of cards of color 'putColor'
                  ++logColours[putColor];
              } else {
                  --require[putColor];
                  std::vector<int> logColoursRec = f(n+1, require);
                  ++require[putColor];
                  // merge child array into your own.
                  for (int i=0; i<maxColor; ++i)
                      logColours[i] += logColoursRec[i];
              }
          }
     }

     // store logColours in a cache corresponding to this function arguments
     cache[n][required] = std::move(logColours);
     return cache[n][required];
 }

缓存可以实现为std::unordered_map<int, std::unordered_map<std::vector<int>, std::vector<int>>>.

一旦你理解了主要思想，你就可以用更高效的代码来实现它。

score 1 · Accepted Answer

我想这是@Ixanezis 的业余友好版本

如果红色获胜

最终结果将是：2红色，0-2绿色，0-3蓝色

假设获胜的红色是A，另一个红色是B，有12种方法获得A和B。

以下是可能的情况：

Cases:            #Cards after A   #Cards before A #pick green #pick blue
0 green, 0 blue:    10! = 3628800     1! = 1          1          1
0 green, 1 blue:    9 ! = 362880      2! = 2          1          4
0 green, 2 blue:    8 ! = 40320       3! = 6          1          6
0 green, 3 blue:    7 ! = 5040        4! = 24         1          4
1 green, 0 blue:    9 ! = 362880      2! = 2          4          1
1 green, 1 blue:    8 ! = 40320       3! = 6          4          4
1 green, 2 blue:    7 ! = 5040        4! = 24         4          6
1 green, 3 blue:    6 ! = 720         5! = 120        4          4
2 green, 0 blue:    8 ! = 40320       3! = 6          6          1
2 green, 1 blue:    7 ! = 5040        4! = 24         6          4
2 green, 2 blue:    6 ! = 720         5! = 120        6          6
2 green, 3 blue:    5 ! = 120         6! = 720        6          4

让sumproduct这 4 个数组：= 29064960，然后乘以 12 = 348779520

同样，您可以为蓝色获胜计算绿色获胜。

score 1 · Accepted Answer

1,2, ..., or cardsCount通过固定排列的第一个元素并std::next_permutation在每个线程中独立调用其他元素，您可以轻松地使代码与线程并行运行。考虑以下代码：

// declarations

// #pragma omp parallel may be here
{ // start of a parallel section
     const int start = (cardsCount * threadIndex) / threadNumber;
     const int end = (cardsCount * (threadIndex + 1)) / threadNumber;

     int cardsIndices[cardsCount]; // a local array for each thread

     for (const int firstElement = start; firstElement < end; ++firstElement) {
         cardsIndices[0] = firstElement;
         // fill other cardsIndices with elements [0-cardsCount], but skipping firstElement
         do {
             // your calculations go here
         } while (std::next_permutation(cardsIndices + 1, cardsIndices + cardsCount)); // note the +1 here
     }
 }

如果您希望将 OpenMP 用作并行化工具，#pragma omp parallel只需在并行部分之前添加即可。并使用 omp_get_thread_num()函数获取线程索引。
您也不必在这里使用 concurrent_vector，这可能会使您的程序非常慢，请使用特定于线程的累积数组：
```
 logColours[threadNumber][3] = {};
 ++logColours[threadIndex][c.Colour];
```
如果Card是一个相当沉重的课程，我建议const Card& c = ...每次都使用而不是复制Card c = ...。

score 0 · Accepted Answer

您可以使用std::thread::hardware_ concurrency()来自<thread>. 引用 A.Williams 的“C++ Concurrency in action”——

C++ 标准库的一个有用的特性是 std::thread::hardware_ concurrency(). 该函数返回对于给定程序的执行可以真正并发运行的线程数的指示。例如，在多核系统上，它可能是 CPU 内核的数量。

c++ - C++ Intel TBB 和 Microsoft PPL，如何在并行循环中使用 next_permutation？

4 回答 4

Related

Reference