Coalesced Access to Global Memory

Note: High Priority: Ensure global memory accesses are coalesced whenever possible.

Perhaps the single most important performance consideration in programming for the CUDA architecture is coalescing global memory accesses. Global memory loads and stores by threads of a half warp (for devices of compute capability 1.x) or of a warp (for devices of compute capability 2.x) are coalesced by the device into as few as one transaction when certain access requirements are met.

To understand these access requirements, global memory should be viewed in terms of aligned segments of 16 and 32 words. Figure 1 helps explain the coalescing of 32-bit words (such as floats) by a half warp. It shows global memory as rows of 64-byte aligned segments (16 floats). Two rows of the same color represent a 128-byte aligned segment. A half warp of threads that accesses the global memory is indicated at the bottom of the figure. Note that this figure assumes a device of compute capability 1.x.

Figure 1. Linear Memory Segments and Threads in a Half Warp

The access requirements for coalescing depend on the compute capability of the device:

These concepts are illustrated in the following simple examples.