I am writing a massively parallel GPU application using CUDA. I have been optimizing it by hand. I received a 20% performance increase with __fdividef_(x, y)
, and according to The Cuda C Programming Guide (section C.2.1), using similar functions for multiplication and adding is also beneficial.
The function is stated as this: __fmul_[rn,rz,ru,rd](x,y)
.
__fdividef(x,y)
was not stated with the arguments in brackets. I was wondering, what are those brackets?
If I run the simple code:
int t = __fmul_(5,4);
I get a compiler error about how __fmul_
is undefined. I have the CUDA runtime included, so I don't think it is a setup thing; rather it is something to do with those square brackets. How do I correctly use this function? Thank you.
EDIT: I should clarify, the compiler is the CUDA-compiler NVCC.