I want to compute a row-sum of an m x n
matrix A
, or equivalently the column-sum of its transpose A'
(I have both in memory so A'
costs me nothing extra in computation). I plan to launch m
threads each of which can either loop over the n
columns of A
, or n
rows of A'
. Which approach will be faster if we assume the matrices are stored in column-major format (i.e. like with CUBLAS)?
My thinking so far (on coalesced memory access):
If I row-sum, then threads in the same block will read from adjacent memory locations at each iteration. Yet equally, if I column-sum instead, then each thread will iterate over a contiguous block of memory. So if I have threads 1
, 2
and 3
of the same block, their memory access will look like so (assuming column-major storage):
1 2 3 ... 1 2 3 ... 1 2 3 ... for row-sums
1 1 1 ... 2 2 2 ... 3 3 3 ... for column-sums
- But this doesn't tell me which will be faster.
- It also doesn't take into account the behavior at block-level (i.e. if the first block launched sums over rows
1-32
, will the 2nd block launched be guaranteed to sum over rows33-64
?)