I am trying to use Intel TBB to parallelise an inner loop (the 2nd of 3) however, i only get decent pay off when the inner 2 loops are significant in size.
Is TBB spawning new threads for every iteration of the major loop? Is there anyway to reduce the overhead?
tbb::task_scheduler_init tbb_init(4); //I have 4 cores
tbb::blocked_range<size_t> blk_rng(0, crs_.y_sz, crs_.y_sz/4);
boost::chrono::system_clock::time_point start =boost::chrono::system_clock::now();
for(unsigned i=0; i!=5000; ++i)
{
tbb::parallel_for(blk_rng,
[&](const tbb::blocked_range<size_t>& br)->void
{
:::
It might be interesting to note that openMP (which I am trying to remove!!!) doesn't have this problem.
I am compiling with:
intel ICC 12.1 at -03 -xHost -mavx
On a intel 2500k (4 cores)
EDIT: I can really change the order of loops, because the out loops test need to be replace with a predicate based on the loops result.