c++ - Splitting up a program into 4 threads is slower than a single thread

Question

I've been writing a raytracer the past week, and have come to a point where it's doing enough that multi-threading would make sense. I have tried using OpenMP to parallelize it, but running it with more threads is actually slower than running it with one.

Reading over other similar questions, especially about OpenMP, one suggestion was that gcc optimizes serial code better. However running the compiled code below with export OMP_NUM_THREADS=1 is twice as fast as with export OMP_NUM_THREADS=4. I.e. It's the same compiled code on both runs.

Running the program with time:

> export OMP_NUM_THREADS=1; time ./raytracer
real    0m34.344s
user    0m34.310s
sys     0m0.008s


> export OMP_NUM_THREADS=4; time ./raytracer
real    0m53.189s
user    0m20.677s
sys     0m0.096s

User time is a lot smaller than real, which is unusual when using multiple cores- user should be larger than real as several cores are running at the same time.

Code that I have parallelized using OpenMP

void Raytracer::render( Camera& cam ) {

    // let the camera know to use this raytracer for probing the scene
    cam.setSamplingFunc(getSamplingFunction());

    int i, j;

    #pragma omp parallel private(i, j)
    {

        // Construct a ray for each pixel.
        #pragma omp for schedule(dynamic, 4)
        for (i = 0; i < cam.height(); ++i) {
            for (j = 0; j < cam.width(); ++j) {
                cam.computePixel(i, j);
            }
        }
    }
}

When reading this question I thought I had found my answer. It talks about the implementation of gclib rand() synchronizing calls to itself to preserve state for random number generation between threads. I am using rand() quite a lot for monte carlo sampling, so i thought that was the problem. I got rid of calls to rand, replacing them with a single value, but using multiple threads is still slower. EDIT: oops turns out I didn't test this correctly, it was the random values!

Now that those are out of the way, I will discuss an overview of what's being done on each call to computePixel, so hopefully a solution can be found.

In my raytracer I essentially have a scene tree, with all objects in it. This tree is traversed a lot during computePixel when objects are tested for intersection, however, no writes are done to this tree or any objects. computePixel essentially reads the scene a bunch of times, calling methods on the objects (all of which are const methods), and at the very end writes a single value to its own pixel array. This is the only part that I am aware of where more than one thread will try to write to to the same member variable. There is no synchronization anywhere since no two threads can write to the same cell in the pixel array.

Can anyone suggest places where there could be some kind of contention? Things to try?

Thank you in advance.

EDIT: Sorry, was stupid not to provide more info on my system.

Compiler gcc 4.6 (with -O2 optimization)
Ubuntu Linux 11.10
OpenMP 3
Intel i3-2310M Quad core 2.1Ghz (on my laptop at the moment)

Code for compute pixel:

class Camera {

    // constructors destructors
    private:
        // this is the array that is being written to, but not read from.
        Colour* _sensor; // allocated using new at construction.
}

void Camera::computePixel(int i, int j) const {

    Colour col;

    // simple code to construct appropriate ray for the pixel
    Ray3D ray(/* params */);
    col += _sceneSamplingFunc(ray); // calls a const method that traverses scene. 

    _sensor[i*_scrWidth+j] += col;
}

From the suggestions, it might be the tree traversal that causes the slow-down. Some other aspects: there is quite a lot of recursion involved once the sampling function is called (recursive bouncing of rays)- could this cause these problems?

score 4 · Accepted Answer

感谢大家的建议，但在进一步分析并消除其他影响因素后，随机数生成确实是罪魁祸首。

如上述问题所述，rand() 需要跟踪其从一次调用到下一次调用的状态。如果多个线程试图修改这个状态，就会导致竞争条件，所以 glibc 中的默认实现是锁定每个调用，以使函数线程安全。这对性能来说是可怕的。

不幸的是，我在 stackoverflow 上看到的这个问题的解决方案都是本地的，即在调用 rand() 的范围内处理问题。相反，我提出了一个“快速而肮脏”的解决方案，任何人都可以在他们的程序中使用它来为每个线程实现独立的随机数生成，不需要同步。

我已经测试了代码，它可以工作——没有锁定，也没有由于调用 threadrand 而导致明显的减速。随意指出任何明显的错误。

线程rand.h

#ifndef _THREAD_RAND_H_
#define _THREAD_RAND_H_

// max number of thread states to store
const int maxThreadNum = 100;

void init_threadrand();

// requires openmp, for thread number
int threadrand();

#endif // _THREAD_RAND_H_

线程rand.cpp

#include "threadrand.h"
#include <cstdlib>
#include <boost/scoped_ptr.hpp>
#include <omp.h>

// can be replaced with array of ordinary pointers, but need to
// explicitly delete previous pointer allocations, and do null checks.
//
// Importantly, the double indirection tries to avoid putting all the
// thread states on the same cache line, which would cause cache invalidations
// to occur on other cores every time rand_r would modify the state.
// (i.e. false sharing)
// A better implementation would be to store each state in a structure
// that is the size of a cache line
static boost::scoped_ptr<unsigned int> randThreadStates[maxThreadNum];

// reinitialize the array of thread state pointers, with random
// seed values.
void init_threadrand() {
    for (int i = 0; i < maxThreadNum; ++i) {
        randThreadStates[i].reset(new unsigned int(std::rand()));
    }
}

// requires openmp, for thread number, to index into array of states.
int threadrand() {
    int i = omp_get_thread_num();
    return rand_r(randThreadStates[i].get());
}

main现在您可以从using初始化线程的随机状态init_threadrand()，然后threadrand()在 OpenMP 中使用多个线程时使用 using 获得一个随机数。

score 2 · Accepted Answer

答案是，不知道你在哪台机器上运行它，也没有真正看到你的computePixel函数代码，这取决于它。

有很多因素可能会影响代码的性能，想到的一件事是缓存对齐。也许您的数据结构，并且您确实提到了一棵树，对于缓存来说并不是真正的理想选择，并且 CPU 最终会等待来自 RAM 的数据，因为它无法将内容放入缓存中。错误的缓存行对齐可能会导致类似的情况。如果 CPU 必须等待来自 RAM 的内容，则线程很可能会被上下文切换出去，然后运行另一个线程。

您的操作系统线程调度程序是非确定性的，因此，线程何时运行不是可预测的事情，因此，如果您的线程没有大量运行，或者正在争夺 CPU 内核，这也可能会减慢速度。

线程亲和性，也起作用。一个线程将被安排在一个特定的核心上，通常它会尝试将这个线程保持在同一个核心上。如果您的多个线程在单个内核上运行，则它们必须共享同一个内核。事情可能放缓的另一个原因。出于性能原因，一旦特定线程在内核上运行，它通常会保留在那里，除非有充分的理由将其交换到另一个内核。

还有一些其他因素，我不记得了，但是，我建议阅读有关线程的内容。这是一个复杂而广泛的课题。那里有很多材料。

最后写入的数据是其他线程需要能够做的数据computePixel吗？

score 1 · Accepted Answer

我只是看了看，英特尔 i3-2310M实际上并没有 4 个内核，它有 2 个内核和超线程。尝试仅使用 2 个线程运行您的代码，看看是否有帮助。我发现通常超线程在您进行大量计算时完全没用，并且在我的笔记本电脑上我将其关闭并获得了更好的项目编译时间。

事实上，只需进入您的 BIOS 并关闭 HT——它对开发/计算机器没有用处。

score 1 · Accepted Answer

一种很大的可能性是虚假分享。看起来您正在按顺序计算像素，因此每个线程可能都在处理交错的像素。这通常是一件非常糟糕的事情。

可能发生的情况是，每个线程都试图将像素值写入另一个线程中写入的像素值（它们都写入传感器阵列）。如果这两个输出值共享同一个 CPU 缓存行，这将强制 CPU 刷新处理器之间的缓存。这会导致 CPU 之间的刷新量过多，这是一个相对较慢的操作。

要解决此问题，您需要确保每个线程真正在独立区域上工作。现在看来你按行划分（我不肯定，因为我不知道 OMP）。这是否有效取决于您的行有多大——但每行的结尾仍然会与下一行的开头重叠（就缓存行而言）。您可能想尝试将图像分成四个块，并让每个线程处理一系列连续的行（例如 1..10 11..20 21..30 31..40）。这将大大减少共享。

不用担心读取常量数据。只要数据块没有被修改，每个线程都可以有效地读取此信息。但是，请警惕常量数据中的任何可变数据。

c++ - Splitting up a program into 4 threads is slower than a single thread

4 回答 4

Related

Reference