c++ - Mergesort pThread 实现与单线程花费相同的时间

Question

（我试图尽可能简化这一点，以找出我做错了什么。）

代码的想法是我有一个全局数组 *v （我希望使用这个数组不会减慢速度，线程不应该访问相同的值，因为它们都在不同的范围内工作）并且我尝试创建 2 个线程每个通过调用具有相应参数的函数 merge_sort() 对前半部分和后半部分进行排序。

在线程运行中，我看到进程将使用 80-100% cpu 使用率（在双核 cpu 上），而在没有线程运行时它仅保持在 50%，但运行时间非常接近。

这是（相关的）代码：

//这是2个排序函数，每个线程都会调用merge_sort(..)。这是一个问题吗？两个线程都调用相同的（正常）函数？

void merge (int *v, int start, int middle, int end) {
    //dynamically creates 2 new arrays for the v[start..middle] and v[middle+1..end]
    //copies the original values into the 2 halves
    //then sorts them back into the v array
}

void merge_sort (int *v, int start, int end) {
    //recursively calls merge_sort(start, (start+end)/2) and merge_sort((start+end)/2+1, end) to sort them
    //calls merge(start, middle, end) 
}

//这里我期望创建每个线程并在其特定范围内调用merge_sort（这是原始代码的简化版本，以便更容易找到错误）

void* mergesort_t2(void * arg) {
    t_data* th_info = (t_data*)arg;
    merge_sort(v, th_info->a, th_info->b);
    return (void*)0;
}

//在main中我只是创建了2个线程调用上面的函数

int main (int argc, char* argv[])
{
    //some stuff

    //getting the clock to calculate run time
    clock_t t_inceput, t_sfarsit;
    t_inceput = clock();

    //ignore crt_depth for this example (in the full code i'm recursively creating new threads and i need this to know when to stop)
    //the a and b are the range of values the created thread will have to sort
    pthread_t thread[2];
    t_data next_info[2];
    next_info[0].crt_depth = 1;
    next_info[0].a = 0;
    next_info[0].b = n/2;
    next_info[1].crt_depth = 1;
    next_info[1].a = n/2+1;
    next_info[1].b = n-1;

    for (int i=0; i<2; i++) {
        if (pthread_create (&thread[i], NULL, &mergesort_t2, &next_info[i]) != 0) {
            cerr<<"error\n;";
            return err;
        }
    }

    for (int i=0; i<2; i++) {
        if (pthread_join(thread[i], &status) != 0) {
            cerr<<"error\n;";
            return err;
        }
    }

    //now i merge the 2 sorted halves
    merge(v, 0, n/2, n-1);

    //calculate end time
    t_sfarsit = clock();

    cout<<"Sort time (s): "<<double(t_sfarsit - t_inceput)/CLOCKS_PER_SEC<<endl;
    delete [] v;
}

输出（100 万个值）：

Sort time (s): 1.294

直接调用 merge_sort 输出，无线程：

Sort time (s): 1.388

输出（1000 万个值）：

Sort time (s): 12.75

直接调用 merge_sort 输出，无线程：

Sort time (s): 13.838

解决方案：

我还要感谢 WhozCraig 和 Adam，因为他们从一开始就暗示了这一点。

我已经使用了该inplace_merge(..)函数而不是我自己的函数，并且程序运行时间与现在一样。

这是我的初始合并函数（不确定是否是初始的，我可能已经修改了几次，现在数组索引也可能是错误的，我在 [a,b] 和 [a,b] 之间来回切换），这只是最后一个被注释掉的版本）：

void merge (int *v, int a, int m, int c) { //sorts v[a,m] - v[m+1,c] in v[a,c]

    //create the 2 new arrays
    int *st = new int[m-a+1];
    int *dr = new int[c-m+1];
    //copy the values
    for (int i1 = 0; i1 <= m-a; i1++)
        st[i1] = v[a+i1];
    for (int i2 = 0; i2 <= c-(m+1); i2++)
        dr[i2] = v[m+1+i2];

    //merge them back together in sorted order
    int is=0, id=0;
    for (int i=0; i<=c-a; i++)  {
        if (id+m+1 > c || (a+is <= m && st[is] <= dr[id])) {
            v[a+i] = st[is];
            is++;
        }
        else {
            v[a+i] = dr[id];
            id++;
        }
    }
    delete st, dr;
}

所有这些都被替换为：

inplace_merge(v+a, v+m, v+c);

编辑，有时在我的 3ghz 双核 cpu 上：

100 万个值：1 个线程：7.236 s 2 个线程：4.622 s 4 个线程：4.692 s

1000 万个值：1 个线程：82.034 秒 2 个线程：46.189 秒 4 个线程：47.36 秒

score 0 · Accepted Answer

注意：由于 OP 使用 Windows，我在下面的回答（错误地假定为 Linux）可能不适用。我离开它是为了那些可能会发现这些信息有用的人。

clock()是在 Linux 上测量时间的错误接口：它测量程序使用的 CPU 时间（参见http://linux.die.net/man/3/clock），在多线程的情况下是 CPU 时间的总和所有线程。您需要测量经过的时间或挂钟时间。在这个 SO 问题中查看更多详细信息：C: using clock() to measure time in multi-threaded programs，它还说明了可以使用什么 API 来代替clock().

在您尝试比较的基于 MPI 的实现中，使用了两个不同的进程（这就是 MPI 通常启用并发性的方式），并且不包括第二个进程的 CPU 时间 - 因此 CPU 时间接近挂钟时间。尽管如此，使用 CPU 时间（等等clock()）进行性能测量仍然是错误的，即使在串行程序中也是如此；出于一个原因，如果一个程序等待网络事件或来自另一个 MPI 进程的消息，它仍然会花费时间 - 但不是 CPU 时间。

更新：在微软实现的 C 运行时库中，clock()返回 wall-clock time，所以可以用于您的目的。但目前尚不清楚您是否使用 Microsoft 的工具链或其他工具，例如 Cygwin 或 MinGW。

score 0 · Accepted Answer

有一件事让我印象深刻：“动态创建 2 个新数组 [...]”。由于两个线程都需要系统内存，因此它们需要为此获取锁，这很可能是您的瓶颈。特别是进行微观阵列分配的想法听起来非常低效。有人建议了一种不需要任何额外存储的就地排序，这对性能来说要好得多。

另一件事是任何 big-O 复杂度测量中经常被遗忘的半句：“有一个 n0，因此对于所有 n>n0...”。换句话说，也许你还没有达到 n0？我最近看到一个视频（希望其他人会记得），其中一些人试图确定某些算法的这个限制，他们的结果是这些限制非常高。

c++ - Mergesort pThread 实现与单线程花费相同的时间

2 回答 2

Related

Reference