parallel-processing - 使用 pthread 会增加执行时间，改进建议

Question

我有一段代码，看起来像这样，

 for(i=0;i<NumberOfSteps;i++)
{
    for(k=0;k<NumOfNodes;k++)
    {

        mark[crawler[k]]++;
        r = rand() % node_info[crawler[k]].num_of_nodes;
        crawler[k] = (int)DataBlock[node_info[crawler[k]].index+r][0];
    }
}

我对其进行了更改，以便可以在多个线程之间分配负载。现在看起来像这样

for(i=0;i<NumberOfSteps;i++)
{
    for(k=0;k<NumOfNodes;k++)
    {            
        pthread_mutex_lock( &mutex1 );
        mark[crawler[k]]++;
        pthread_mutex_unlock( &mutex1 );

        pthread_mutex_lock( &mutex1 );
        r = rand() % node_info[crawler[k]].num_of_nodes;
        pthread_mutex_unlock( &mutex1 );

        pthread_mutex_lock( &mutex1 );
        crawler[k] = (int)DataBlock[node_info[crawler[k]].index+r][0];
        pthread_mutex_unlock( &mutex1 );
   }
}

我需要互斥锁来保护共享变量。事实证明我的并行代码比较慢。但为什么？是因为互斥体吗？

这可能与缓存线大小有关吗？

score 0 · Accepted Answer

You are not parallelizing anything but the loop heads. Everything between lock and unlock is forced to be executed sequentially. And since lock/unlock are (potentially) expensive operations, the code is getting slower.

To fix this, you should at least separate expensive computations (without mutex protection) from access to shared data areas (with mutexes). Then try to move the mutexes out of the inner loop.

You could use atomic increment instructions (depends on platform) instead of plain '++', which is generally cheaper than mutexes. But beware of doing this often on data of a single cache line from different threads in parallel (see 'false sharing').

AFAICS, you could rewrite the algorithm as indicated below with out needing mutexes and atomic increment at all. getFirstK() is NumOfNodes/NumOfThreads*t if NumOfNodes is an integral multiple of NumOfThreads.

for(t=0;t<NumberOfThreads;t++)
{
    kbegin = getFirstK(NumOfNodes, NumOfThreads, t);
    kend   = getFirstK(NumOfNodes, NumOfThreads, t+1);

    // start the following in a separate thread with kbegin and kend 
    // copied to thread local vars kbegin_ and kend_

    int k, i, r;
    unsigned state = kend_; // really bad seed

    for(k=kbegin_;k<kend_;k++)
    {
        for(i=0;i<NumberOfSteps;i++)
        {
            mark[crawler[k]]++;
            r = rand_r(&state) % node_info[crawler[k]].num_of_nodes;
            crawler[k] = (int)DataBlock[node_info[crawler[k]].index+r][0];
        }
    }
}
// wait for threads/jobs to complete

This way to generate random numbers may lead to bad random distributions, see this question for details.

parallel-processing - 使用 pthread 会增加执行时间，改进建议

1 回答 1

Related

Reference