0

I am trying to use Intel TBB to parallelise an inner loop (the 2nd of 3) however, i only get decent pay off when the inner 2 loops are significant in size.

Is TBB spawning new threads for every iteration of the major loop? Is there anyway to reduce the overhead?

tbb::task_scheduler_init tbb_init(4); //I have 4 cores
tbb::blocked_range<size_t> blk_rng(0, crs_.y_sz, crs_.y_sz/4);
boost::chrono::system_clock::time_point start   =boost::chrono::system_clock::now();
for(unsigned i=0; i!=5000; ++i)
{   
    tbb::parallel_for(blk_rng, 
    [&](const tbb::blocked_range<size_t>& br)->void
    {   
    :::

It might be interesting to note that openMP (which I am trying to remove!!!) doesn't have this problem.

I am compiling with:

intel ICC 12.1 at -03 -xHost -mavx

On a intel 2500k (4 cores)

EDIT: I can really change the order of loops, because the out loops test need to be replace with a predicate based on the loops result.

4

1 回答 1

1

不,TBB 不会为每次调用parallel_for. 实际上,与每个可能启动一个新线程组的 OpenMP 并行区域不同,TBB 与同一个线程组一起工作,直到所有task_scheduler_init对象都被销毁;并且在隐式初始化(task_scheduler_init​​省略)的情况下,相同的工作线程被使用到程序结束。

所以性能问题是由其他原因引起的。根据我的经验,最可能的原因是:

  • 缺乏编译器优化,首先是自动矢量化(可以通过比较 OpenMP 和 TBB 的单线程性能来检查;如果 TBB 慢得多,那么这是最可能的原因)。
  • 缓存未命中;如果你 5000 次运行相同的数据,缓存局部性非常重要,而且 OpenMP 的默认设置schedule(static)非常好,每次确定性地重复完全相同的分区,而 TBB 的工作窃取调度程序具有很大的随机性。将blocked_range粒度设置为problem_size/num_threads可以确保每个线程一个工作,但不能保证相同的工作分布;并且affinity_partitioner应该对此有所帮助。
于 2012-02-16T12:13:31.663 回答