To hide latency arising from register dependencies, maintain sufficient numbers of active threads per multiprocessor (i.e., sufficient occupancy). (Registers and Hiding Register Dependencies)
The number of threads per block should be a multiple of 32 threads, because this provides optimal computing efficiency and facilitates coalescing. (Thread and Block Heuristics)
Use the fast math library whenever speed trumps precision. (Math Libraries)
Prefer faster, more specialized math functions over slower, more general ones when possible. (Math Libraries)