I want to compute a row-sum of an m x n matrix A, or equivalently the column-sum of its transpose A' (I have both in memory so A' costs me nothing extra in computation). I plan to launch m threads each of which can either loop over the n columns of A, or n rows of A'. Which approach will be faster if we assume the matrices are stored in column-major format (i.e. like with CUBLAS)?
My thinking so far (on coalesced memory access):
If I row-sum, then threads in the same block will read from adjacent memory locations at each iteration. Yet equally, if I column-sum instead, then each thread will iterate over a contiguous block of memory. So if I have threads 1, 2 and 3 of the same block, their memory access will look like so (assuming column-major storage):
1 2 3 ... 1 2 3 ... 1 2 3 ... for row-sums
1 1 1 ... 2 2 2 ... 3 3 3 ... for column-sums
- But this doesn't tell me which will be faster.
- It also doesn't take into account the behavior at block-level (i.e. if the first block launched sums over rows
1-32, will the 2nd block launched be guaranteed to sum over rows33-64?)