c++ - C++ 算法循环 - 并行处理？

Question

我一直在尝试对我用 C++ 编写的素数生成器的多线程进行一些研究，我发现我想要做的就是所谓的“并行处理”。在过去的大约 45 分钟里，我一直在研究这个问题，但我似乎无法弄清楚。

我要执行此操作的代码大约有 95 行，这太长了，无法在此处发布，但这是基本概念：

unsigned long long i, total;

for(i;true;i++){
    total = total + i;
    cout << "Your new total is " << total << endl;
}

有什么办法可以将它流式传输到 2 个处理器，以便它们一起工作而不是竞争？如果是这样，我将如何编码？我对 C++ 有点熟悉，但还有很多我不知道的地方，因此非常感谢您提供深入的回答。

编辑：第一次使用错误的算法。我觉得这就是。

编辑 2：由于很多答案都说这取决于我的算法，所以我将发布我的代码，因为它只有 95 行。

/*Generic GPL stuff, coded by me */

#include <iostream>
#include <list>
#include <fstream>
using namespace std;

int main(){
    //Declare some variables and what not.
    unsigned long long count = 0, misc = 0, length = 0, limit = 0;
    list <long long> primes;
    ifstream inFile;
    ofstream outFile;

    cout << "Initializing starting values based on your existing file of generated prime numbers.\n";

    //Now let's get our starting values;
    inFile.open("/home/user/Desktop/primes.txt");

    //First, we need to find the prime generator thus far
    for(unsigned long long x=0;inFile.good();x++){
        inFile >> count;

        if(!(bool)(x%100000000) && x!=0){
            misc = x/100000000;

            cout << misc << "00000000 primes read so far...\n";
        }
    }

    inFile.close();

    cout << "Highest generated prime found.\n";

    //Now, as much as I hate to say it, we need to parse part of the file again now that we have the largest prime.
    inFile.open("/media/ssd/primes_src.txt");

    for(length; limit < count; length++){
        inFile >> misc;
    }

    inFile.close();

    limit = misc * misc;

    cout << "Initialization complete. Now generating primes.\n";

    //Loop time
    l:

    //We're just going to flat-out skip even numbers
    count++;
    count++;

    //This checks to see if the number it's trying to test is beyond the current limit of accuracy.
    if(count >= limit){

        // Now if we are, we have 1 more possible prime factor
        length++;

        inFile.open("/media/ssd/primes_src.txt");

        for(unsigned long long x=0; x < length; x++){
            inFile >> misc;
        }

        inFile.close();

        limit = misc * misc;
    }

    inFile.open("/media/ssd/primes_src.txt");
    inFile >> misc; //We don't care about 2

    for(unsigned long long x=1; x < length; x++){
        inFile >> misc;

        if(!(bool)(count%misc)){
            inFile.close();

            goto l;
        }
    }

    inFile.close();

    outFile.open("/home/user/Desktop/primes.txt", ios::out | ios::app);

    //Now if we haven't been "goto"d, we add it to the file.
    outFile << count << endl;

    outFile.close();

    goto l;

    return 0;
}

/home/user/Desktop/primes.txt 是我保存所有生成的素数的文件。
/media/ssd/primes_src.txt 是我的文件，其中包含高达 2^32 的所有素数加上 1 个素数，以供良好测量。

score 1 · Accepted Answer

假设i = iterator，显示的代码确实使得的值total不依赖于 for 循环的先前迭代。您的算法似乎无需太多努力即可并行化。

最简单的方法是在编译器选项中启用OpenMP，然后在 for 循环之前添加以下代码：

#pragma omp parallel for
for(...)

请注意，此答案假定您的算法的每次迭代都不依赖于前一次迭代（否则您将不得不输入一些代码以防止竞争条件）。

编辑：您的算法现在不容易并行化。一些注意事项：

如果您可以将计算划分为独立的块，那么该算法很容易并行化（每个块一个线程）
如果算法在不修改旧数据的情况下创建新数据，并且不读取新数据的状态，那么它也是可并行化的
如果你必须有迭代的结果n - 1才能进行迭代n，那么你就完蛋了。这里最好的选择是拿一张纸和一支铅笔，并在数学上（或逻辑上）尝试以不同的方式格式化你的算法（即，改变你的算法！）。

score 1 · Accepted Answer

我不知道您的算法是否适合这种方法，但我完成并行工作的一种方法是创建多个线程，它们都完全独立运行，除了一个更新“下一个候选者”的点（我正在计算奇怪的数字，所以我的更新是i = __sync_fetch_and_add(&current, 2);“目前为止的数字处理”。 __sync_fetch_and_add() 是 g++ 中的标准函数，但微软编译器有同样的东西，称为InterLockedAdd().

当我运行我的“基准”时，我的机器上的 4 个内核（100% = 1 个内核）只比 400% 的改进少了一点点。

我使用了普通的 pthread_create()，当我从输入达到给定范围内的“最大值”时，每个线程都会结束。

正如所承诺的：一个简单的素数查找器：

#include <iostream>
#include <cstring>
#include <cstdlib>
#include <pthread.h>

using namespace std;

static int current;
static int max_value = 7780;

static void *find_prime(void *)
{
    for(;;)
    {
        int i = __sync_fetch_and_add(&current, 2);
        bool prime = true;

        if (i > max_value)
        {
            pthread_exit(NULL);
        }
        for(int j = 2; j < i && prime; j++)
        {
            if (!(i % j))
            {
                prime = false;
            }
        }
        if (prime)
        {
            cout << i << " " << flush;
        }
    }
}


int main(int argc, char **argv)
{
    int    start = 3;
    int    threads = 1;
    pthread_t *thread_id;

    for(int i = 1; i < argc; i++)
    {
        if (strcmp(argv[i], "-t") == 0 && argc > i+1)
        {
            i++;
            threads = strtol(argv[i], NULL, 0);
        }
        if (strcmp(argv[i], "-e") == 0 && argc > i+1)
        {
            i++;
            max_value = strtol(argv[i], NULL, 0);
        }
    }

    current = start;

    cout << "1 2 " << flush;

    thread_id = new pthread_t[threads-1];
    for(int i = 0; i < threads; i++)
    {
        int rc = pthread_create(&thread_id[i], NULL, find_prime, NULL);
        if (rc != 0)
        {
            cerr << "Huh? Pthread couldn't be created. rc=" << rc << endl;
        }
    }
    for(int i = 0; i < threads; i++)
    {
        pthread_join(thread_id[i], NULL);
    }
    cout << endl;
}

评论：主要启动“线程”线程数（-t num在命令行上指定 - 还有一个-e num定义“最大值”）。每个线程使用 __sync_fetch_and_add() 函数“挑选”一个数字。线程检查它是否是素数，然后迭代 j 以尝试除数。如果数字是素数，则打印，否则只需选择下一个数字。

如果您愿意，而不是打印数字[并且给定足够大的数字，您可能会遇到cout <<从线程内调用的问题]，您可以使用数组，并使用 int my_index = __sync_fetch_and_add(&index, 1); 并用它来存储到一个数组中。

自然地，如果每个循环不能完全独立运行，这种方法就不起作用——那么事情就会变得更加复杂。

编辑：请注意，此代码中缺少许多有用的错误检查。如果你给零个线程，它不会做任何事情，如果你给一个负数的最终值，谁知道，等等。

$ 时间 ./prime -t 1 -e 100000 > /dev/null

real    0m5.574s
user    0m5.553s
sys     0m0.009s

和：时间 ./prime -t 4 -e 100000 > /dev/null

real    0m1.762s
user    0m5.572s
sys     0m0.010s

如您所见，它的速度几乎快了 4 倍。

score 0 · Accepted Answer

0

您可以查看此代码，该代码使用 openMP 计算素数

于 2012-12-25T22:19:14.117 回答

score 0 · Accepted Answer

并行化的唯一方法是跟踪 N 个总数，并在循环后将它们加在一起。或者，如果添加代表一些更复杂的功能，请尝试使用互斥锁来访问共享变量。不过，这很可能在性能方面很糟糕......

c++ - C++ 算法循环 - 并行处理？

4 回答 4

Related

Reference