The problem
During a project in CUDA C, I came across unexpected behaviour regarding single precision and double precision floating point operations. In the project, I first fill an array with number in a kernel and in another kernel, I do some computation on these numbers. All variables and arrays are double precision, so I would not expect any single precision floating point operation to happen. However, if I analyze the executable of the program using NVPROF, it shows that single precision operations are executed. How is this possible?
Minimal, Complete, and Verifiable example
Here is the smallest program, that shows this behaviour on my architecture: (asserts and error catching has been left out). I use a Nvidia Tesla k40 graphics card.
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#define Nx 10
#define Ny 10
#define RANDOM double(0.236954587566)
__global__ void test(double *array, size_t pitch){
double rho, u;
int x = threadIdx.x + blockDim.x*blockIdx.x;
int y = threadIdx.y + blockDim.y*blockIdx.y;
int idx = y*(pitch/sizeof(double)) + 2*x;
if(x < Nx && y < Ny){
rho = array[idx];
u = array[idx+1]/rho;
array[idx] = rho*u;
}
}
__global__ void fill(double *array, size_t pitch){
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
int idx = y*(pitch/sizeof(double)) + 2*x;
if(x < Nx || y < Ny){
array[idx] = RANDOM*idx;
array[idx + 1] = idx*idx*RANDOM;
}
}
int main(int argc, char* argv[]) {
double *d_array;
size_t pitch;
cudaMallocPitch((void **) &d_array, &pitch, 2*Nx*sizeof(double), Ny);
dim3 threadDistribution = dim3(8,8);
dim3 blockDistribution = dim3( (Nx + threadDistribution.x - 1) / (threadDistribution.x), (Ny + threadDistribution.y) / (threadDistribution.y));
fill <<< blockDistribution, threadDistribution >>> (d_array, pitch);
cudaDeviceSynchronize();
test <<< blockDistribution, threadDistribution >>> (d_array, pitch);
return 0;
}
The output of NVPROF (edited to make it more readable, if you need the full output, just ask in the comments):
....
Device "Tesla K40c (0)"
Kernel: test(double*, unsigned long)
Metric Name Min Max Avg
flop_count_sp 198 198 198
flop_count_sp_add 0 0 0
flop_count_sp_mul 0 0 0
flop_count_sp_fma 99 99 99
flop_count_sp_special 102 102 102
flop_count_dp 1214 1214 1214
flop_count_dp_add 0 0 0
flop_count_dp_mul 204 204 204
flop_count_dp_fma 505 505 505
What I've found so far
I found that if I delete the division in line 16:
u = array[idx+1]/rho;
==>
u = array[idx+1];
the output is as expected: zero single precision operations and exactly 100 double precision operations are executed. Does anyone know why the division causes the program to use single precision flop and 10 times more double precision floating point operations? I've also tried using intrinsics (__ddiv_rn), but this didn't solve the problem.
Many thanks in advance!
Edit - Working solution
Altough I still haven't figured out why it uses the single precision, I have found a 'solution' to this problem, thanks to @EOF. Replacing the division by multiplication with the reciprocal of rho did the job:
u = array[idx+1]/rho;
==>
u = array[idx+1]*__drcp_rn(rho);