6

I am currently working on a C++ sparse matrix/math/iterative solver library, for a simulation tool I manage. I would have preferred to use an existing package, however, after extensive investigation, none were found that were appropriate for our simulator (we looked at flens, it++, PetSC, eigen, and several others). The good news is my solvers and sparse matrix structures are now very efficient and robust. The bad news is, I am now looking into parallelization using OpenMP, and the learning curve is a bit steep.

The domain we solve can be broken into sub-domains, which come together in a block-diagonal format. So our storage scheme ends up looking like an array of smaller square matrices (blocks[]) each with a format appropriate to the sub-domain (e.g. Compressed Row Storage: CRS, Compressed Diagonal Storage: CDS, Dense, etc..), and a background matrix (currently using CRS) that accounts for the connectivity between sub-domains.

The "hot spot" in most (all?) iterative solvers is the Matrix Vector multiplication operation, and this is true of my library. Thus, I've been focusing my efforts on optimizing my MxV routines. For the block diagonal structure, the pseudo code for M*x=b would be as follows:

b=background_matrix*x
start_index = 1;
end_index = 0;
for(i=1:number of blocks) {
    end_index=start_index+blocks[i].numRows();
    b.range(start_index, end_index) += blocks[i] * x.range(start_index, end_index);
    start_index = end_index+1;
}

where background_matrix is the background (CRS) matrix, blocks is the array of sub-domain matrices, and .range returns the portion of the vector from a starting index to an ending index.

Obviously the loop can be (and has been) parallelized, as the operations are independent of other iterations of the loop (the ranges are non-overlapping). Since we have 10-15 blocks in a typical system, 4+ threads actually makes a significant difference.

The other place where parallelization has been seen to be a good option in is in the MxV operation for each sub-domain storage scheme (calls in lines 1 and 6 in the above code). There is plenty out there on parallelizing CRS, CDS, and dense matrix MxV operations. Typically a nice boost is seen with 2 threads, with greatly diminishing returns as more threads are added.

I am envisioning a scheme, where 4 threads would be used in the block loop for the above code, and each of those threads would use 2 threads for the sub-domain solves. However, I am not sure how, using OpenMP, one would manage the pool of threads- is it possible to limit the number of threads in an openmp for loop? Is this multi-level parallelism something that in practice makes sense? Any other thoughts on what I've proposed here would be appreciated (and thanks for reading all the way to the end!)

4

2 回答 2

5

请注意,我描述的所有内容都取决于实现。

是否可以限制openmp for循环中的线程数?

是的。有不同的方法可以做到这一点。在你的外循环中设置omp_set_nested(1);和使用类似#pragma omp parallel for num_threads(4)或类似的东西,#pragma omp parallel for num_threads(2)在你的内循环中设置和使用指令。这应该为您提供 8 个线程(取决于实现,OMP_THREAD_LIMIT如果您的内核少于 8 个,您可能还必须设置)

或者,您可以手动展开循环,例如使用类似

#pragma omp parallel sections {
     #pragma omp section 
     do your stuff for the first part, nest parallel region again
     #pragma omp section 
     and so on for the other parts
}

您有时可以在 OpenMP 3.0 中使用#pragma omp task.

或者您启动 8 个线程并在并行部分中获取当前线程号并根据线程号手动调度。

最后,如果你有一个完美嵌套的循环(一个循环是完美嵌套的,如果实际赋值只发生在最里面的循环中),你可以将所有内容重写为一个循环。基本上把你的两个迭代器打包ij一个大的迭代器(i, j)。请注意,这可能会降低局部性并因此降低性能

这种多级并行性在实践中是否有意义?

这取决于,你必须自己找出答案。通常,多级并行性使您的问题更具可扩展性。然而,调度可能更复杂。这篇论文可能很有趣。

关于手动设置线程数:设置线程数的主要优点是您可以在调度时使用有关您问题的特定知识。因此,您可以减少开销并获得更高的正在执行代码的局部性,从而获得更多的缓存命中和更少的主内存 I/O。

在嵌套并行中手动设置线程数的主要缺点是最内层循环中的线程可能会在隐式屏障处空闲等待,而可以完成额外的工作(示例)。此外,粗粒度并行性不能很好地扩展。因此,如果您的外部循环在循环内的运行时间非常不同,那么您需要更灵活地调度而不是简单地分成 4 个线程。

任何其他想法

你有没有想过用 SIMD 做 MxV。根据架构,这可以提供 2-4 的加速。我很快为您搜索了此演示文稿

对于 MxV,循环平铺寄存器和缓存阻塞以及相关技术可以增加数据局部性并减少其他问题,例如错误共享。这本书第 11 章(您可以预览)可能会给您一些关于如何重构数据访问的额外想法。

于 2010-07-01T20:06:40.863 回答
0

为什么不在 OpenMP.org 上向专家咨询?

注册并登录: http ://openmp.org/forum/viewforum.php?f=3

于 2010-07-01T19:41:12.023 回答