Two types of runtime math operations are supported. They can be distinguished by their names: some have names with prepended underscores, whereas others do not (e.g., __functionName() versus functionName()). Functions following the __functionName() naming convention map directly to the hardware level. They are faster but provide somewhat lower accuracy (e.g., __sinf(x) and __expf(x)). Functions following functionName() naming convention are slower but have higher accuracy (e.g., sinf(x) and expf(x)). The throughput of __sinf(x), __cosf(x), and __expf(x) is much greater than that of sinf(x), cosf(x), tanf(x). The latter become even more expensive (about an order of magnitude slower) if the magnitude of the argument x needs to be reduced. Moreover, in such cases, the argument-reduction code uses local memory, which can affect performance even more because of the high latency of local memory. More details are available in the CUDA C Programming Guide.
Note also that whenever sine and cosine of the same argument are computed, the sincos… family of instructions should be used to optimize performance:
The –use_fast_math compiler option of nvcc coerces every functionName() call to the equivalent __functionName() call. This switch should be used whenever accuracy is a lesser priority than the performance. This is frequently the case with transcendental functions. Note this switch is effective only on single-precision floating point.
For small integer powers (e.g., x2 or x3), explicit multiplication is almost certainly faster than the use of general exponentiation routines such as pow(). While compiler optimization improvements continually seek to narrow this gap, explicit multiplication (or the use of an equivalent purpose-built inline function or macro) can have a significant advantage. This advantage is increased when several powers of the same base are needed (e.g., where both x2 and x5 are calculated in close proximity), as this aids the compiler in its common sub-expression elimination (CSE) optimization.
For exponentiation using base 2 or 10, use the functions exp2() or expf2() and exp10() or expf10() rather than the functions pow() or powf(). Both pow() and powf() are heavy-weight functions in terms of register pressure and instruction count due to the numerous special cases arising in general exponentiation and the difficulty of achieving good accuracy across the entire ranges of the base and the exponent. The functions exp2(), exp2f(), exp10(), and exp10f(), on the other hand, are similar to exp() and expf() in terms of performance, and can be as much as ten times faster than their pow()/powf() equivalents.
For exponentiation with an exponent of 1/3, use the cbrt() or cbrtf() function rather than the generic exponentiation functions pow() or powf(), as the former are significantly faster than the latter. Likewise, for exponentation with an exponent of -1/3, use rcbrt() or rcbrtf().
Replace sin(π*<expr>) with sinpi(<expr>) and cos(π*<expr>) with cospi(<expr>). This is advantageous with regard to both accuracy and performance. As a particular example, to evaluate the sine function in degrees instead of radians, use sinpi(x/180.0). Similarly, the single-precision functions sinpif() and cospif() should replace calls to sinf() and cosf() when the function argument is of the form π*<expr>. (The performance advantage sinpi() has over sin() is due to simplified argument reduction; the accuracy advantage is because sinpi() multiplies by π only implicitly, effectively using an infinitely precise mathematical π rather than a single- or double-precision approximation thereof.)