I want to try gst_inst_128bit instruction. In the same program, nvvp give a lot of gst_inst_128bit command executed. While in nsight's profiler, 4 times gst_inst_32bit instructions is obtained. They should be a same program. How could this situation happen?
The experiment was tried on Linux, CUDA 5.0, GTX 580. The program is only copying data from one array to another in kernel function: In main:
cudaMalloc((void**)&dev_a, NUM * sizeof(float));
cudaMalloc((void**)&dev_b, NUM * sizeof(float));
kernel<<<grid,block>>>((uint4 *)dev_a, (uint4 *)dev_b);
the kernel:
__global__ void kernel(uint4 *a, uint4 *b){
unsigned int id = blockIdx.x * THREAD_NUM + threadIdx.x;
for(unsigned int i = 0;i < LOOP/4;i++){
b[id + i * GRID_NUM * THREAD_NUM] = a[id + i * GRID_NUM * THREAD_NUM];