parallel-processing - 是否启用了 gfortan 整个数组表达式？

Question

我是 fortran 和 gfortran 的新手。我了解到整个表达式数组是并行计算的，但我发现计算只发生在我计算机的一个核心中。

我使用以下代码：

program prueba_matrices

 implicit none

 integer, parameter                             :: num = 5000
 double precision, dimension(1:num,1:num)       :: A, B, C
 double precision, dimension (num*num)          :: temp
 integer                               :: i

 temp = (/ (i/2.0, i=1,num*num) /)
 A = reshape(temp, (/ num, num/) )
 B = reshape(temp, (/ num, num/) )
 C = matmul(A , B)

end program prueba_matrices

我是这样遵守的：

gfortran prueba_matrices.f03 -o prueba_gfortran

而且，观察 gnome-system-monitor 实时生成的图表，我可以看到只有一个核心在工作。如果我用计算代替该行

  C = matmul(A , B)

为了

  C = A * B

它产生相同的行为。

我究竟做错了什么？

score 2 · Accepted Answer

如果您希望从 gfortran 对 matmult 的调用是多线程的，最简单的方法是简单地链接到已使用多线程支持编译的外部BLAS包。候选者包括OpenBlas（née Goto Blas）、ATLAS或商业软件包，如 Intel 的MKL、AMD 的ACML或 Apple 的加速框架。

例如，对于这个简单的例子：

program timematmult

  real, allocatable, dimension(:,:) :: A, B, C
  integer, parameter :: N = 2048

  allocate( A(N,N) )
  allocate( B(N,N) )
  allocate( C(N,N) )

  call random_seed
  call random_number(A)
  call random_number(B)

  C = matmul(A,B)

  print *, C(1,1)

  deallocate(C)
  deallocate(B)
  deallocate(A)

end program timematmult

使用基础 matmul：

$ gfortran -o matmult matmult.f90
$ time ./matmult
   514.38751

real    0m6.518s
user    0m6.374s
sys     0m0.021s

并使用多线程 gotoblas 库：

$ gfortran -o matmult matmult.f90 -fexternal-blas -lgoto2
$ time ./matmult
   514.38696

real    0m0.564s
user    0m2.202s
sys     0m0.964s

此处特别注意，实时时间小于用户时间，表明正在使用多个内核。

score 2 · Accepted Answer

GFortran/GCC 确实具有一些自动并行化功能，请参阅http://gcc.gnu.org/wiki/AutoParInGCC。它们通常不是那么好，因此在任何 -ON 优化级别都没有启用它们，您必须使用 -ftree-parallelize-loops=N 专门选择它，其中 N 是您要使用的线程数。但是请注意，在上面的示例中，像“A*B”这样的循环可能会受到内存带宽的限制（对于足够大的阵列），因此添加内核可能没有太大帮助。此外，MATMUL 内在函数导致 gfortran 运行时库中的实现，该实现不使用 autopar 选项编译（除非您专门以这种方式构建它）。

可以帮助您上面的示例代码更多的是实际启用任何优化。使用 -O3 Gfortran 会自动启用矢量化，这也可以看作是并行化循环的一种方式，尽管不是在多个 cpu 内核上。

score 1 · Accepted Answer

I think that a key sentence in the course that you cited is "With array assignment there is no implied order of the individual assignments, they are performed, conceptually, in parallel." The key word is "conceptually". It isn't saying that whole array expressions are actually executed in parallel; you shouldn't expect more than one core to be used. For that, you need to use OpenMP or MPI (outside of Fortran itself) or the coarrays of Fortran 2008.

EDIT: Fortran didn't have, as part of the language, actual parallel execution until the coarrays of Fortran 2008. Some compilers might provide parallelization otherwise and some language features make it easier for compilers to implement parallel execution (optionally). The sentence that I cited from the web article better states reality than the portion you cite. Whole-array expressions were not intended to require parallel execution; they are a syntactical convenience to the programmer, making the language higher level, so that array operations can be expressed in single statements, without writing do loops. In any case, no article on the web is definitive. Your observation of the lack of parallel executions shows which statement is correct. It does not contradict the Fortran language.

parallel-processing - 是否启用了 gfortan 整个数组表达式？

3 回答 3

Related

Reference