performance - Is it faster to sum the rows or columns of a matrix with CUDA?

Question

I want to compute a row-sum of an m x n matrix A, or equivalently the column-sum of its transpose A' (I have both in memory so A' costs me nothing extra in computation). I plan to launch m threads each of which can either loop over the n columns of A, or n rows of A'. Which approach will be faster if we assume the matrices are stored in column-major format (i.e. like with CUBLAS)?

My thinking so far (on coalesced memory access):

If I row-sum, then threads in the same block will read from adjacent memory locations at each iteration. Yet equally, if I column-sum instead, then each thread will iterate over a contiguous block of memory. So if I have threads 1, 2 and 3 of the same block, their memory access will look like so (assuming column-major storage):

1 2 3 ... 1 2 3 ... 1 2 3 ... for row-sums
1 1 1 ... 2 2 2 ... 3 3 3 ... for column-sums

But this doesn't tell me which will be faster.
It also doesn't take into account the behavior at block-level (i.e. if the first block launched sums over rows 1-32, will the 2nd block launched be guaranteed to sum over rows 33-64?)

score 4 · Accepted Answer

"threads in the same block will read from adjacent memory locations at each iteration"

Is faster. That is pretty much the definition of coalesced access.

score -1 · Accepted Answer

For a single thread:

for (i = 0 to size)
   for (j = 0 to size)
      array[i][j]

Will be faster than

for (i = 0 to size)
   for (j = 0 to size)
      array[j][i]

Becuase in memory each row is laid out in memory contiguously.

But for multible threads it is not as clear cut. If you spawn M threads on a M core CPU then who knows what will happen. Your L1 cache will be core specific but your L3 shared chache will probably not be very helpful assuming your overall matrix is larger than the size of the cache. I think it's fiar to say that there are too many possibilites to make a definative answer. A few thoughts:

Premature optimization is the root of all evil -- DonaldKnuth
Spawning more threads than the number of cores on your machine probably will hurt more than it helps
If you instead swpan the same number of threads as your CPU has cores start them at equaldistances apart in your matrix.

performance - Is it faster to sum the rows or columns of a matrix with CUDA?

2 回答 2

Related

Reference