Device memory allocation and de-allocation via cudaMalloc() and cudaFree() (or their Driver API equivalents) are expensive operations, so device memory should be reused and/or sub-allocated by the application wherever possible to minimize the impact of allocations on overall performance.