c++ - C++ intel TBB inner loop optimisation

Question

I am trying to use Intel TBB to parallelise an inner loop (the 2nd of 3) however, i only get decent pay off when the inner 2 loops are significant in size.

Is TBB spawning new threads for every iteration of the major loop? Is there anyway to reduce the overhead?

tbb::task_scheduler_init tbb_init(4); //I have 4 cores
tbb::blocked_range<size_t> blk_rng(0, crs_.y_sz, crs_.y_sz/4);
boost::chrono::system_clock::time_point start   =boost::chrono::system_clock::now();
for(unsigned i=0; i!=5000; ++i)
{   
    tbb::parallel_for(blk_rng, 
    [&](const tbb::blocked_range<size_t>& br)->void
    {   
    :::

It might be interesting to note that openMP (which I am trying to remove!!!) doesn't have this problem.

I am compiling with:

intel ICC 12.1 at -03 -xHost -mavx

On a intel 2500k (4 cores)

EDIT: I can really change the order of loops, because the out loops test need to be replace with a predicate based on the loops result.

score 1 · Accepted Answer

不，TBB 不会为每次调用parallel_for. 实际上，与每个可能启动一个新线程组的 OpenMP 并行区域不同，TBB 与同一个线程组一起工作，直到所有task_scheduler_init对象都被销毁；并且在隐式初始化（task_scheduler_init省略）的情况下，相同的工作线程被使用到程序结束。

所以性能问题是由其他原因引起的。根据我的经验，最可能的原因是：

缺乏编译器优化，首先是自动矢量化（可以通过比较 OpenMP 和 TBB 的单线程性能来检查；如果 TBB 慢得多，那么这是最可能的原因）。
缓存未命中；如果你 5000 次运行相同的数据，缓存局部性非常重要，而且 OpenMP 的默认设置schedule(static)非常好，每次确定性地重复完全相同的分区，而 TBB 的工作窃取调度程序具有很大的随机性。将blocked_range粒度设置为problem_size/num_threads可以确保每个线程一个工作，但不能保证相同的工作分布；并且affinity_partitioner应该对此有所帮助。

c++ - C++ intel TBB inner loop optimisation

1 回答 1

Related

Reference