I've created a simple kernel to test the coalesced memory access by observing the transaction counts, in nvidia gtx980 card. The kernel is,
__global__
void copy_coalesced(float * d_in, float * d_out)
{
int tid = threadIdx.x + blockIdx.x*blockDim.x;
d_out[tid] = d_in[tid];
}
When I run this with the following kernel configurations
#define BLOCKSIZE 32
int data_size = 10240; //always a multiply of the BLOCKSIZE
int gridSize = data_size / BLOCKSIZE;
copy_coalesced<<<gridSize, BLOCKSIZE>>>(d_in, d_out);
Since the the data access in the kernel is fully coalasced, and since the data type is float (4 bytes), The number of Load/Store Transactions expected can be found as following,
Load Transaction Size = 32 bytes
Number of floats that can be loaded per transaction = 32 bytes / 4 bytes = 8
Number of transactions needed to load 10240 of data = 10240/8 = 1280 transactions
The same amount of transactions are expected for writing the data as well.
But when observing the nvprof metrics, following was the results
gld_transactions 2560
gst_transactions 1280
gld_transactions_per_request 8.0
gst_transactions_per_request 4.0
I cannot figure out why it takes twice the transactions that it needs for loading the data. But when it comes to load/store efficiency both the metrics gives out 100%
What am I missing out here?