Note: High Priority: Ensure global memory accesses are coalesced whenever possible.
Perhaps the single most important performance consideration in programming for the CUDA architecture is coalescing global memory accesses. Global memory loads and stores by threads of a half warp (for devices of compute capability 1.x) or of a warp (for devices of compute capability 2.x) are coalesced by the device into as few as one transaction when certain access requirements are met.
To understand these access requirements, global memory should be viewed in terms of aligned segments of 16 and 32 words. Figure 1 helps explain the coalescing of 32-bit words (such as floats) by a half warp. It shows global memory as rows of 64-byte aligned segments (16 floats). Two rows of the same color represent a 128-byte aligned segment. A half warp of threads that accesses the global memory is indicated at the bottom of the figure. Note that this figure assumes a device of compute capability 1.x.
Figure 1. Linear Memory Segments and Threads in a Half Warp
The access requirements for coalescing depend on the compute capability of the device:
- On devices of compute capability 1.0 or 1.1, the k-th thread in a half warp must access the k-th word in a segment aligned to 16 times the size of the elements being accessed; however, not all threads need to participate.
- On devices of compute capability 1.2 or 1.3, coalescing is achieved for any pattern of accesses that fits into a segment size of 32 bytes for 8-bit words, 64 bytes for 16-bit words, or 128 bytes for 32- and 64-bit words. Smaller transactions may be issued to avoid wasting bandwidth. More precisely, the following protocol is used to issue a memory transaction for a half warp:
- Find the memory segment that contains the address requested by the lowest numbered active thread. Segment size is 32 bytes for 8-bit data, 64 bytes for 16-bit data, and 128 bytes for 32-, 64-, and 128-bit data.
- Find all other active threads whose requested address lies in the same segment, and reduce the transaction size if possible:
If the transaction is 128 bytes and only the lower or upper half is
used, reduce the transaction size to 64 bytes.
If the transaction is 64 bytes and only the lower or upper half is used, reduce the transaction size to 32 bytes.
- Carry out the transaction and mark the serviced threads as inactive.
- Repeat until all threads in the half warp are serviced.
- On devices of compute capability 2.x, memory accesses by the threads of a warp are coalesced into the minimum number of L1-cache-line-sized aligned transactions necessary to satisfy all threads; see Section F.4.2 of the CUDA C Programming Guide.
These concepts are illustrated in the following simple examples.