-------------------------------------------------------------------------------- NVIDIA CUDA Programming Guide Revision History -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Version 2.1 -------------------------------------------------------------------------------- - Updated Chapter 1 and 2 - Section 4.2.2.4 - Clarified that dynamically allocated __shared__ variables is the only case where a __shared__ variable can be defined as external using the extern keyword - Section 4.5.2.3 - Added cudaMalloc3D() and cudaMalloc3DArray() - Section 4.5.3.6 - Mentioned cuArray3DCreate() - Section 4.2.5 - Added quick description of -arch compiler option - Appendix B - Mentioned that double-precision floating-point functions are overloaded to fall back to the single-precision version when taking float arguments - Section B.1 - Updated error for some functions of Table B.1 - Section B.2 - Updated error for some functions of Table B.2 -------------------------------------------------------------------------------- Version 2.1 Beta -------------------------------------------------------------------------------- - Section 4.2.3 - Dg.z must be equal to 1 - Sections 4.2.5, 4.5.3.4 - PTX code can now be compiled through the driver API - Sections 4.5.1.4, 4.5.2.8, 4.5.3.11 - Updated with Direc3D 10 interoperability - Section 4.5.2.2 - Any subsequent explicit call to cudaSetDevice() will now fail - Section 4.5.2.7 - cudaGLSetGLDevice() must be called for proper OpenGL interoperability - Section 4.5.3.10 - Context must be created with cuGLCtxCreate() for OpenGL interoperability - Section 4.6 - Mode switches cause runtime calls to fail - Section A.1 - Updated with latest GPUs -------------------------------------------------------------------------------- Version 2.0 -------------------------------------------------------------------------------- - Rewrote Chapter 1, 2, and 3 - Section 4.3.3 - Clarified that the counter returned by clock() is per multiprocessor - Section 4.5.2.9 - Specified that warp size is equal to one in device emulation mode - Section 4.5.1.5 - Modified paragraph on operations that break concurrent execution of streams - Section 4.5.3.6 - Added a code sample that copies host memory to constant memory - Section 5.1.1.1 - Mentioned that double variables are 64-bit even for devices that without native double-precision support - Section 5.1.2.5 - Mentioned that arrays of double in shared memory cause two-way bank conflicts - Section 5.2 rewritten - Section A.1 - Added more GPUs -------------------------------------------------------------------------------- Version 2.0 Beta -------------------------------------------------------------------------------- - Moved content of Section 4.4.3 and 4.4.4 to appendix B - New Section 4.2.4.5 to document the new warpSize built-in variable - Sections 4.3.4, Section 4.4.3.2, Section 4.5.2.6, Section A.1.1, and Appendix D updated with 3D texture support - Section 4.3.1.1 - Added double2 - Section 4.3.4.1 - Textures only support single-precision floating-point - Section 4.4.4 - Modified to take in account the new atomic functions for 64-bit integers - New Section 4.4.5 on vote functions - Section 4.5.1.5 - Mentioned how to check for concurrent memcpy/execution using the runtime API - Section 4.5.1.4, 4.5.2.8, and 4.5.3.11 - Updated to the new Direct3D interoperability API - Section 4.5.1.5 - Added information on interaction between streamed and non-streamed tasks/operations - Added information on event relationship with stream 0 - Added description of cudaStreamDestroy behavior - Section 4.5.2.4 - Added a sentence on cudaStreamSynchronize() use - Section 4.5.2.5 - Added explicit event destruction - Section 4.5.3.3 - Contexts can be attached to and detached from any host threads - Section 4.5.3.7 - Added a description of cuStreamSynchronize() - Section 4.5.3.8 - Added explicit event destruction - Included a reminder about events in stream 0 - Section 5.1.1.1 - Mentioned that trigonometric functions use local memory in some cases - Section 5.1.1.3 - Mentioned local memory - Section 5.1.2.1 - Updated with coalescing rules for devices of compute capability 1.2 and higher - New Section 5.1.2.2 on local memory - Section 5.2 - Mentioned double and long long - Section A.1 - Added specifications for each compute capability - Section A.2 - Added the deviation from IEEE for double-precision floating-point numbers - Section B.1.1 - Mentioned new functions __fadd_rn() and __fmul_rn() - Smaller error for powf() - New Sections B.1.2 and B.2.2 on double-precision floating-point functions - Section B.2.1 - Added __fadd_rn() and __fmul_rn() - Section B.2.3 - New functions __popc() and __popcll() - Appendix C - 32-bit shared memory atomic operations, as well as 64-bit global memory atomic operations are supported for compute capabilities 1.2 and higher - Removed Appendices D and E since this information is now part of the reference manual -------------------------------------------------------------------------------- Version 1.1 -------------------------------------------------------------------------------- - Section 3.2 - Partially rewritten for more clarity - Section 4.2.2.3 - Volatile can be used to prevent shared memory reads/writes compiler optimization - Section 4.2.2.4 - Mentioned that the extern keyword cannot be used with __device__, __shared__ and __constant__ variables - Clarified that __constant__ variables can be assigned from host code through host runtime functions - Clarified use of address of __device__, __shared__ and __constant__ variables - Mentioned that ptxas verbose mode shows local memory use - Section 4.2.3 - Mentioned that execution configuration is passed via shared memory - Mentioned that execution configuration is subject to device limitations - Section 4.2.5 - New compiler directives: __noinline__ and #pragma unroll - Section 4.3.4.1 - Fixed description of cudaReadModeNormalizedFloat (signed integers map to [-1,1] - Section 4.5.1.2 - Moved page-locked memory allocation to this section - Clarified when higher bandwidth is achieved with page-locked memory - Section 4.5.1.4 - Mentioned new functions cudaD3D9GetDevice() and cuD3D9GetDevice() - Section 4.5.1.5 - In 1.0, only memset, intra-device memcpy, and kernel launches were asynchronous; in 1.1, the family of *Async memcpy functions enables asynchronous host <-> device memory copies too - Some devices can also perform host <-> device memory copies concurrently with kernel execution; this is enabled through the "stream" abstraction - The "event" abstraction enables applications to closely monitor the device’s progress and perform accurate timing - Section 4.5.2.3 - Fixed: cudaGetSymbolAddress() does not work on constant variables - Section 5.1.1.1 - Clarified why square root is implemented as it is - Section 5.1.2.1 - Mentioned performance differences between 32-bit, 64-bit, 128-bit, coalesced or non-coalesced global memory accesses - Illustrated memory coalescing with figures - Section 5.1.2.3 - Clarified that texture cache is designed for streaming fetches - Section 5.2 - Mentioned that ptxas verbose mode reports number of registers - Section 5.3 - Mentioned page-locked memory to increase bandwidth - Section 5.4 - Clarified texture cache coherency with respect to global memory writes - New Section 5.5 about overall performance optimization strategies - Appendix A - Added more devices supporting CUDA - Section A.1 - Mentioned the cache working set for texture memory - Section B.1 - Updated maximum ulp error for expm1f, sinf, cosf, tanf, sincosf, powf - Added llrintf, llroundf - Section B.1 - Mentioned integer min() and max() - Added __[u]sad, __clz[ll], _ffs[ll], __[u]mul64hi - Section C.1.8 - Removed "float atomicCAS(float*, float, float)" - New Section D.3 about stream management - New Section D.4 about event management - Section D.5 - Added asynchronous versions of memory copy functions - Section D.5.5 - cudaFreeArray(0) is a no-op - Sections D.5.18, D.5.19, D.5.20 - Fixed missing reference to constant memory in function descriptions - Sections D.9.3, D.9.4, D.9.6 - Fixed typos in function descriptions - Section E.2 - New cuDeviceGetAttribute() function - New Section E.5 about stream management - New Section E.6 about event management - Section E.7.9 - New cuLaunchGridAsync() function - Section E.8 - Added asynchronous versions of memory copy functions - Section E.8.4 - Fixed typos (cuMalloc() and cuMallocPitch()) -------------------------------------------------------------------------------- Version 1.0 -------------------------------------------------------------------------------- - Added references to Tesla - Section 4.2.2.4 - __device__ only allowed at file scope - Section 4.3.4.1 - Clarified that 3-component texture format are unsupported - Section 4.5 - Fixed typos in code samples - Section 4.5.1.5 - Functions that free memory are synchronous - Section 4.5.2.3 - Mentioned cudaMallocHost() and cudaFreeHost() - Section 4.5.2.7 - Mentioned that all code must be compiled either in device emulation mode or in device execution mode - Section 4.5.3.7 - Clarified code sample - Section 5.1.1.1 - Warned about [u]mul24 being slower that 32-bit integer multiply on future devices - Mentioned that double type gets demoted to float type on devices that do not support double precision - Section B.1 - Modified to clarify that error bounds only apply to the single-precision version of each function - Section B.2 - 8 most significant bits are ignored for [u]mul24 - Added some description of __saturate() - Section C - Fixed typos in function prototypes - Section D - Fixed typos in function prototypes - Section D.1.2 - clockRate in kHz - Section D.2.1 - cudaThreadSynchronize() returns an error if one of the preceding tasks failed - Section D.3.4 - Removed one-dimensional array if height is zero - New Sections D.3.6 and D.3.7 about cudaMallocHost() and cudaFreeHost() - Sections D.3.18 and D.3.19 - Added missing optional copy direction - Section D.3.21 - cudaGetSymbolSize also allows symbol to be in constant memory - New Section D.4.1.6 about cudaGetTextureAlignmentOffset() - New Section D.4.2.1 about cudaCreateChannelDesc() - Section E - Fixed typos in function prototypes - Section E.2.6 - clockRate in kHz - Section E.3.5 - cuCtxSynchronize() returns an error if one of the preceding tasks failed - Section E.4.3 - Documented cuModuleLoadFatBinary() - Section E.5.7 - Clarified that texunit must be CU_PARAM_TR_DEFAULT - New Section E.6.1 about cuMemGetInfo() - New Section E.6.12 about cuMemset2D() - New Appendix F on texture fetching -------------------------------------------------------------------------------- Version 0.9 -------------------------------------------------------------------------------- - Chapter 5 moved to Appendix A - New Section 4.4.6 and new Appendix B on atomic functions - Former Chapter 6 is now Chapter 5 - Former Appendices A, B and C are now appendices C, D and E, respectively - Updated Section 4.5.3 and Appendix E to reflect the new driver API naming conventions: - cudaMemAlloc2D renamed to cudaMemAllocPitch - cuMemAlloc2D renamed to cuMemAllocPitch - cuMemAllocSystem renamed to cuMemAllocHost - cuMemFreeSystem renamed to cuMemFreeHost - cuMemcpyStoD renamed to cuMemcpyHtoD - cuMemcpyDtoS renamed to cuMemcpyDtoH - cuMemcpyStoA renamed to cuMemcpyHtoA - cuMemcpyAtoS renamed to cuMemcpyAtoH - CU_MEMORYTYPE_SYSTEM renamed to CU_MEMORYTYPE_HOST - NumPackedComponents renamed to NumChannels (in CUDA_ARRAY_DESCRIPTOR) - New CUcontext handle in the driver API (cuCtxCreate(), cuCtxAttach(), and cuCtxDetach() modified accordingly) - New argument to cuInit() - Section 3.2 - Atomic writes to same address are serialized - New Section 3.3 about compute capability - New Section 3.4 about multiple devices - New Section 3.5 about mode switches - Section 4.2.1.4 - Kernel invocations are now asynchronous - Section 4.2.2.3 - Clarified shared memory semantics - Section 4.3.4 - Restructured for more clarity - Section 4.4.3 - New type conversion functions - Section 4.4.5 - Texture fetch functions have changed names - Section 4.5.1.1 - Clarified that several host threads can execute device code on the same device - Clarified that host threads cannot share CUDA resources - Section 4.5.1.4 - D3DCREATE_HARDWARE_VERTEXPROCESSING required - Mapping of more than one vertex buffers simultaneously is now supported - New Section 4.5.1.5 about API asynchronicity - Section 4.5.2 - Fixed typos in sample codes - Section 4.5.2.4 - Clarified that texture format mush match texture reference declaration - Section 4.5.3.7 - Clarified that texture format mush match texture reference declaration - Section 5.1.2.1 - Added the common case of accessing an array of structures - Section 5.1.2.5 and Section 5.2 - A multiple of 64 for the number of threads per block is better for optimal register usage - Appendix A - Removed the table since these characteristics can now be queried with the runtime - Added limits on block dimensions - Added maximum width for 1D CUDA arrays - Added maximum kernel size - Appendix B: - Updated error bounds for some math functions - Additional math functions - Section D.1.2 - Additional properties in cudaDeviceProp - New Section D.2 about thread management - New cudaThreadSynchronize() function - New cudaThreadExit() function - Section D.3.4 - 1D CUDA arrays are created by specifying height=0 - Section D.4.1.4 - New definition for cudaBindTexture() - Section D.4.2.1 - New definition for cudaBindTexture() - cudaBindTexture() for CUDA arrays renamed cudaBindTextureToArray() - Section E.2.6 - New cuDeviceGetProperties() function - Section E.3 - cuCtxGetDevice() is now documented - New cuCtxSynchronize() function - Section E.5.7 - Fixed name and description of function (cuParamSetTexRef()) - Section E.6.7 - 1D CUDA arrays are created by specifying height=0 - Section E.7.5 - New definition for cuTexRefSetAddress() -------------------------------------------------------------------------------- Version 0.8.2 -------------------------------------------------------------------------------- - Section 3.2 - Clarified the sentence about writes to the same location in memory - Section 4.3.4: - Specified that reading and writing from same texture produces undefined results - Section 5.1: - Relaxed restriction on multi-device support - Section 5.2: - More details on treatment of denormalized numbers - Section C.7.7 and C.7.8: - cuTexRefSetAddressMode() and cuTexRefSetFilterMode() have no effect for texture references bound to linear memory -------------------------------------------------------------------------------- Version 0.8.1 -------------------------------------------------------------------------------- - Fixed typos and formatting throughout the document - Mention Quadro FX 5600/4600 - Refer to cudaArray as "CUDA array" to avoid confusion with regular C arrays - Clocks now refer to processor's clock - Section 3.2 - Clarified how a block is split into warps - Clafified register allocation - Moved number of registers per thread to Section 6.2 - Section 4.2.1.4: - Specified that device functions are inlined - Section 4.2.2.3: - Rewrote code sample - Section 4.2.2.4: - Updated pointer support - Section 4.2.3: - Clarified that dynamic shared memory is per block - Section 4.3.2: - Clarified when C runtime implementation is used - Section 4.3.4: - Moved part of Section 4.5.1.3 to this section - Clarified that cudaReadModeNormalizedFloat is only supported for 16-bit and 8-bit integer format - Clarified that addressing modes are not supported for textures in linear memory - Section 4.4.4: - Clarified usage of texfetch() - Section 4.5.1.1: - Moved execution on multiple devices to this section - Section 4.5.1.3: - Removed: API considerations went to Section 4.3.4 and performance considerations went to Section 6.1.2.3 - Section 4.5.2: - Moved function descriptions to reference appendix B - Added more code samples - Section 4.5.3: - Moved function descriptions to reference appendix C - Added more code samples - Section 4.5.3.5: - Added description of page-locked host memory allocation - Section 5.1: - Added maximum texture dimensions - Added number of registers - Added various limits related to concurrent threads - Removed bus bandwidth since there is a test application in SDK 0.8.1 - Specified that multi-device is for same-type GPUs only - Based explanation on processor clock - Section 6.1.1.2: - Mentioned that threads reconverge after a control flow instruction - Section 6.1.2.1: - Coalescing constraints explained in terms of half-warps with recommendation to fulfill them for the whole warp - Section 6.1.2.2: - Read cost expressed in terms of half-warps with ecommendation to fulfill condition for the whole warp - Section 6.1.2.3: - Moved performance consideration of Section 4.5.1.3 to this section - Section 6.1.2.4: - Moved some figures around - Section 6.2: - Moved number of registers per thread from Section 3.2 to this section - Added definition of multiprocessor occupancy - Section 7.1: - Mentioned that the example has not been written for performance -------------------------------------------------------------------------------- Version 0.8 -------------------------------------------------------------------------------- - Section 3.2 - Specified that write order is undefined for threads writing to same location - Chapter 4: - Documented texture support - Documented OpenGL and Direct3D interoperability - Documented driver API - Section 4.2.2 - Removed __local__ - Section 4.2.2.3 - Added an example of managing layout of dynamic shared memory - Section 4.3.3 - Fixed function name - More detailed description - Section 4.4.5 - Removed __trap() - Section 4.5.2.7 - Fixed code sequence related to _controlfp - Section 5.1 - Specified size of constant and 1D-texture working sets - Section 5.2 - Precision on NaN propagation - Section 6.1.1.1 - Fixed throughput of sin, cos, and exponentiation - Specified througput of 32-bit integer multiplication - Mentioned __fdividef - Section 6.1.1.2 - Reworded the bit on branch predication - Section 6.1.2.1 - More details on alignment requirement - Mentioned __align() - Reworded memory coalescing - Mentioned the relevant 2D allocation routines - Section 6.1.2.5 - Mentioned number of threads require to ignore register RAW dependencies - Appendix A: - Updated error bounds for some math functions - Additional math functions -------------------------------------------------------------------------------- Alpha2 Version of November 21, 2006 -------------------------------------------------------------------------------- - Chapter 1: - Mostly re-written with additional diagrams - Chapter 2: - Moved memory model from Chapter 3 to Chapter 2 - Section 3.2: - Rephrased after moving memory model to Chapter 2 - Section 4.2.2: - Clarified the __device__ type qualifier - Section 4.4.2: - Removed the memory mapping functions since they have been removed from Beta - Section 4.5: - Mentioned the type casting functions - Section 4.5.2: - Elaborated more on when using __syncthreads - Section 4.6.2: - Mentioned the __DEVICE_EMULATION__ preprocessor macro defined in device emulation mode - Added denormalized numbers as an important potential difference between device emulation and device execution modes - Chapter 5: - More technical specifications - Section 6.1.1.1: - Mentioned performance of bitwise operations, integer multiplication, modulo, and divide - Mentioned __mul24 - Advised on the handling of single vs double precision functions and constants - Section 6.1.1.2: - More details on when the compiler chooses to predicate - Section 6.1.2.1: - Clarified memory coalescing for local memory - More details on alignment constrainst for global memory coalescing - Section 6.1.2.4: - More graphics to illustrate bank conflicts - Fixed wrong statement about bank conflicts when accessing structure members - Chapter 7: - Now using 2D arrays for As and Bs - Fixed wrong statement about bank conflict - Appendix A: - Updated error bounds for some math functions - Additional math functions - Recommendation of using rintf instead of roundf -------------------------------------------------------------------------------- Alpha2 Version of September 28, 2006 -------------------------------------------------------------------------------- First version