c - How do I use compiler intrinsic __fmul_?

Question

I am writing a massively parallel GPU application using CUDA. I have been optimizing it by hand. I received a 20% performance increase with __fdividef_(x, y), and according to The Cuda C Programming Guide (section C.2.1), using similar functions for multiplication and adding is also beneficial.

The function is stated as this: __fmul_[rn,rz,ru,rd](x,y).

__fdividef(x,y) was not stated with the arguments in brackets. I was wondering, what are those brackets?

If I run the simple code:

int t = __fmul_(5,4);

I get a compiler error about how __fmul_ is undefined. I have the CUDA runtime included, so I don't think it is a setup thing; rather it is something to do with those square brackets. How do I correctly use this function? Thank you.

EDIT: I should clarify, the compiler is the CUDA-compiler NVCC.

score 3 · Accepted Answer

You should specify rounding mode with ru (rounding up) or rd (rounding down). There is no function __fmul_ but available function signatures are __fmul_rd or __fmul_ru.

score 0 · Accepted Answer

CUDA Programming Guide explains the suffixes:

_rd: round down.
_rn: round to nearest even.
_ru: round up.
_rz: round towards zero.

See CUDA's Single Precision Intrinsics documentation for details on these functions.

c - How do I use compiler intrinsic __fmul_?

2 回答 2

Related

Reference