DeckLink Quad HDMI Recorder.Capture one pro 12 requirements free
Looking for:
Capture one pro 12 requirements freeSearch Results.Capture one pro 12 requirements free
In such case, CUDA IPC APIs will share the entire underlying memory block which may cause other sub-allocations to be shared, which can potentially lead to information disclosure between processes. To prevent this behavior, it is recommended to only share allocations with a 2MiB aligned size. An example of using the IPC API is where a single primary process generates a batch of input data, making the data available to multiple secondary processes without requiring regeneration or copying.
All runtime functions return an error code, but for an asynchronous function see Asynchronous Concurrent Execution , this error code cannot possibly report any of the asynchronous errors that could occur on the device since the function returns before the device has completed the task; the error code only reports errors that occur on the host prior to executing the task, typically related to parameter validation; if an asynchronous error occurs, it will be reported by some subsequent unrelated runtime function call.
The only way to check for asynchronous errors just after some asynchronous function call is therefore to synchronize just after the call by calling cudaDeviceSynchronize or by using any other synchronization mechanisms described in Asynchronous Concurrent Execution and checking the error code returned by cudaDeviceSynchronize. The runtime maintains an error variable for each host thread that is initialized to cudaSuccess and is overwritten by the error code every time an error occurs be it a parameter validation error or an asynchronous error.
Kernel launches do not return any error code, so cudaPeekAtLastError or cudaGetLastError must be called just after the kernel launch to retrieve any pre-launch errors. To ensure that any error returned by cudaPeekAtLastError or cudaGetLastError does not originate from calls prior to the kernel launch, one has to make sure that the runtime error variable is set to cudaSuccess just before the kernel launch, for example, by calling cudaGetLastError just before the kernel launch.
Kernel launches are asynchronous, so to check for asynchronous errors, the application must synchronize in-between the kernel launch and the call to cudaPeekAtLastError or cudaGetLastError. On devices of compute capability 2. Reading data from texture or surface memory instead of global memory can have several performance benefits as described in Device Memory Accesses.
Texture memory is read from kernels using the device functions described in Texture Functions. The process of reading a texture calling one of these functions is called a texture fetch. Each texture fetch specifies a parameter called a texture object for the texture object API or a texture reference for the texture reference API. Textures can also be layered as described in Layered Textures. Cubemap Textures and Cubemap Layered Textures describe a special type of texture, the cubemap texture.
Texture Gather describes a special texture fetch, texture gather. A texture object is created using cudaCreateTextureObject from a resource description of type struct cudaResourceDesc , which specifies the texture, and from a texture description defined as such:. Some of the attributes of a texture reference are immutable and must be known at compile time; they are specified when declaring the texture reference.
A texture reference is declared at file scope as a variable of type texture :. A texture reference can only be declared as a static global variable and cannot be passed as an argument to a function.
The other attributes of a texture reference are mutable and can be changed at runtime through the host runtime. The texture type is defined in the high-level API as a structure publicly derived from the textureReference type defined in the low-level API as such:. Once a texture reference has been unbound, it can be safely rebound to another array, even if kernels that use the previously bound texture have not completed.
It is recommended to allocate two-dimensional textures in linear memory using cudaMallocPitch and use the pitch returned by cudaMallocPitch as input parameter to cudaBindTexture2D. The following code samples bind a 2D texture reference to linear memory pointed to by devPtr :. The format specified when binding a texture to a texture reference must match the parameters specified when declaring the texture reference; otherwise, the results of texture fetches are undefined.
There is a limit to the number of textures that can be bound to a kernel as specified in Table These functions are only supported in device code. Equivalent functions for the host code can be found in the OpenEXR library, for example.
A one-dimensional or two-dimensional layered texture also known as texture array in Direct3D and array texture in OpenGL is a texture made up of a sequence of layers, all of which are regular textures of same dimensionality, size, and data type. A one-dimensional layered texture is addressed using an integer index and a floating-point texture coordinate; the index denotes a layer within the sequence and the coordinate addresses a texel within that layer.
A two-dimensional layered texture is addressed using an integer index and two floating-point texture coordinates; the index denotes a layer within the sequence and the coordinates address a texel within that layer.
Texture filtering see Texture Fetching is done only within a layer, not across layers. Layered textures are only supported on devices of compute capability 2. A cubemap texture is a special type of two-dimensional layered texture that has six layers representing the faces of a cube:. Cubemap textures are fetched using the device function described in texCubemap and texCubemap. Cubemap textures are only supported on devices of compute capability 2.
A cubemap layered texture is a layered texture whose layers are cubemaps of same dimension. A cubemap layered texture is addressed using an integer index and three floating-point texture coordinates; the index denotes a cubemap within the sequence and the coordinates address a texel within that cubemap. Cubemap layered textures are fetched using the device function described in texCubemapLayered and texCubemapLayered.
Cubemap layered textures are only supported on devices of compute capability 2. Texture gather is a special texture fetch that is available for two-dimensional textures only. It is performed by the tex2Dgather function, which has the same parameters as tex2D , plus an additional comp parameter equal to 0, 1, 2, or 3 see tex2Dgather and tex2Dgather.
It returns four bit numbers that correspond to the value of the component comp of each of the four texels that would have been used for bilinear filtering during a regular texture fetch.
For example, if these texels are of values , 20, 31, , , 25, 29, , , 16, 37, , , 22, 30, , and comp is 2, tex2Dgather returns 31, 29, 37, Note that texture coordinates are computed with only 8 bits of fractional precision. For example, with an x texture coordinate of 2.
Since 0. A tex2Dgather in this case would therefore return indices 2 and 3 in x , instead of indices 1 and 2. Texture gather is only supported for CUDA arrays created with the cudaArrayTextureGather flag and of width and height less than the maximum specified in Table 15 for texture gather, which is smaller than for regular texture fetch.
For devices of compute capability 2. Table 15 lists the maximum surface width, height, and depth depending on the compute capability of the device. A surface object is created using cudaCreateSurfaceObject from a resource description of type struct cudaResourceDesc.
A surface reference can only be declared as a static global variable and cannot be passed as an argument to a function. A CUDA array must be read and written using surface functions of matching dimensionality and type and via a surface reference of matching dimensionality; otherwise, the results of reading and writing the CUDA array are undefined.
Unlike texture memory, surface memory uses byte addressing. This means that the x-coordinate used to access a texture element via texture functions needs to be multiplied by the byte size of the element to access the same element via a surface function. Cubemap surfaces are accessed using surfCubemapread and surfCubemapwrite surfCubemapread and surfCubemapwrite as a two-dimensional layered surface, i.
Faces are ordered as indicated in Table 2. Cubemap layered surfaces are accessed using surfCubemapLayeredread and surfCubemapLayeredwrite surfCubemapLayeredread and surfCubemapLayeredwrite as a two-dimensional layered surface, i.
They are one dimensional, two dimensional, or three-dimensional and composed of elements, each of which has 1, 2 or 4 components that may be signed or unsigned 8-, , or bit integers, bit floats, or bit floats. CUDA arrays are only accessible by kernels through texture fetching as described in Texture Memory or surface reading and writing as described in Surface Memory. The texture and surface memory is cached see Device Memory Accesses and within the same kernel call, the cache is not kept coherent with respect to global memory writes and surface memory writes, so any texture fetch or surface read to an address that has been written to via a global write or a surface write in the same kernel call returns undefined data.
In other words, a thread can safely read some texture or surface memory location only if this memory location has been updated by a previous kernel call or memory copy, but not if it has been previously updated by the same thread or another thread from the same kernel call. Registering a resource is potentially high-overhead and therefore typically called only once per resource.
Each CUDA context which intends to use the resource is required to register it separately. In CUDA, it appears as a device pointer and can therefore be read and written by kernels or via cudaMemcpy calls.
Kernels can read from the array by binding it to a texture or surface reference. They can also write to it via the surface write functions if the resource has been registered with the cudaGraphicsRegisterFlagsSurfaceLoadStore flag. The array can also be read and written via cudaMemcpy2D calls.
The application needs to register the texture for interop before requesting an image or texture handle. The following code sample uses a kernel to dynamically modify a 2D width x height grid of vertices stored in a vertex buffer object:.
The following code sample uses a kernel to dynamically modify a 2D width x height grid of vertices stored in a vertex buffer object. There are however special considerations as described below when the system is in SLI mode. Because of this, allocations may fail earlier than otherwise expected.
While this is not a strict requirement, it avoids unnecessary data transfers between devices. Therefore on SLI configurations when data for different frames is computed on different CUDA devices it is necessary to register the resources for each separately. There are two types of resources that can be imported: memory objects and synchronization objects. Depending on the type of memory object, it may be possible for more than one mapping to be setup on a single memory object.
The mappings must match the mappings setup in the exporting API. Any mismatched mappings result in undefined behavior. Imported memory objects must be freed using cudaDestroyExternalMemory. Freeing a memory object does not free any mappings to that object. Therefore, any device pointers mapped onto that object must be explicitly freed using cudaFree and any CUDA mipmapped arrays mapped onto that object must be explicitly freed using cudaFreeMipmappedArray.
It is illegal to access mappings to an object after it has been destroyed. It is illegal to issue a wait before the corresponding signal has been issued.
Also, depending on the type of the imported synchronization object, there may be additional constraints imposed on how they can be signaled and waited on, as described in subsequent sections. Imported semaphore objects must be freed using cudaDestroyExternalSemaphore. All outstanding signals and waits must have completed before the semaphore object is destroyed. When importing memory and synchronization objects exported by Vulkan, they must be imported and mapped on the same device as they were created on.
Note that the Vulkan physical device should not be part of a device group that contains more than one Vulkan physical device. The device group as returned by vkEnumeratePhysicalDeviceGroups that contains the given Vulkan physical device must have a physical device count of 1.
On Windows 7, only dedicated memory objects can be imported. When importing a Vulkan dedicated memory object, the flag cudaExternalMemoryDedicated must be set. Note that CUDA assumes ownership of the file descriptor once it is imported.
Using the file descriptor after a successful import results in undefined behavior. The NT handle holds a reference to the resource, so it must be explicitly freed before the underlying memory can be freed. Since a globally shared D3DKMT handle does not hold a reference to the underlying memory it is automatically destroyed when all other references to the resource are destroyed.
A device pointer can be mapped onto an imported memory object as shown below. The offset and size of the mapping must match that specified when creating the mapping using the corresponding Vulkan API. All mapped device pointers must be freed using cudaFree.
A CUDA mipmapped array can be mapped onto an imported memory object as shown below. The offset, dimensions, format and number of mip levels must match that specified when creating the mapping using the corresponding Vulkan API.
Additionally, if the mipmapped array is bound as a color target in Vulkan, the flag cudaArrayColorAttachment must be set. All mapped mipmapped arrays must be freed using cudaFreeMipmappedArray. The following code sample shows how to convert Vulkan parameters into the corresponding CUDA parameters when mapping mipmapped arrays onto imported memory objects.
The NT handle holds a reference to the resource, so it must be explicitly freed before the underlying semaphore can be freed. Since a globally shared D3DKMT handle does not hold a reference to the underlying semaphore it is automatically destroyed when all other references to the resource are destroyed. An imported Vulkan semaphore object can be signaled as shown below.
Signaling such a semaphore object sets it to the signaled state. The corresponding wait that waits on this signal must be issued in Vulkan. Additionally, the wait that waits on this signal must be issued after this signal has been issued.
An imported Vulkan semaphore object can be waited on as shown below. Waiting on such a semaphore object waits until it reaches the signaled state and then resets it back to the unsignaled state. The corresponding signal that this wait is waiting on must be issued in Vulkan. Additionally, the signal must be issued before this wait can be issued. Please refer to the following OpenGL extensions for further details on how to import memory and synchronization objects exported by Vulkan:.
When importing memory and synchronization objects exported by Direct3D 12, they must be imported and mapped on the same device as they were created on. Note that the Direct3D 12 device must not be created on a linked node adapter.
A shareable Direct3D 12 heap memory object can also be imported using a named handle if one exists as shown below. A shareable Direct3D 12 committed resource can also be imported using a named handle if one exists as shown below.
The offset and size of the mapping must match that specified when creating the mapping using the corresponding Direct3D 12 API. The offset, dimensions, format and number of mip levels must match that specified when creating the mapping using the corresponding Direct3D 12 API. Additionally, if the mipmapped array can be bound as a render target in Direct3D 12, the flag cudaArrayColorAttachment must be set. A shareable Direct3D 12 fence object can also be imported using a named handle if one exists as shown below.
An imported Direct3D 12 fence object can be signaled as shown below. Signaling such a fence object sets its value to the one specified. The corresponding wait that waits on this signal must be issued in Direct3D An imported Direct3D 12 fence object can be waited on as shown below.
Waiting on such a fence object waits until its value becomes greater than or equal to the specified value. The corresponding signal that this wait is waiting on must be issued in Direct3D When importing memory and synchronization objects exported by Direct3D 11, they must be imported and mapped on the same device as they were created on. A shareable Direct3D 11 resource can also be imported using a named handle if one exists as shown below. The offset and size of the mapping must match that specified when creating the mapping using the corresponding Direct3D 11 API.
The offset, dimensions, format and number of mip levels must match that specified when creating the mapping using the corresponding Direct3D 11 API.
The following code sample shows how to convert Direct3D 11 parameters into the corresponding CUDA parameters when mapping mipmapped arrays onto imported memory objects. A shareable Direct3D 11 fence object can also be imported using a named handle if one exists as shown below. A shareable Direct3D 11 keyed mutex object can also be imported using a named handle if one exists as shown below.
An imported Direct3D 11 fence object can be signaled as shown below. An imported Direct3D 11 fence object can be waited on as shown below. An imported Direct3D 11 keyed mutex object can be signaled as shown below. Signaling such a keyed mutex object by specifying a key value releases the keyed mutex for that value. The corresponding wait that waits on this signal must be issued in Direct3D 11 with the same key value.
Additionally, the Direct3D 11 wait must be issued after this signal has been issued. An imported Direct3D 11 keyed mutex object can be waited on as shown below. A timeout value in milliseconds is needed when waiting on such a keyed mutex.
The wait operation waits until the keyed mutex value is equal to the specified key value or until the timeout has elapsed. The timeout interval can also be an infinite value. In case an infinite value is specified the timeout never elapses. Additionally, the Direct3D 11 signal must be issued before this wait can be issued.
Optionally, applications can specify the following attributes -. Note that the attribute list and NvSciBuf objects should be maintained by the application. The offset and size of the mapping can be filled as per the attributes of the allocated NvSciBufObj. Note that ownership of the NvSciSyncObj handle continues to lie with the application even after it is imported.
An imported NvSciSyncObj object can be signaled as outlined below. Signaling NvSciSync backed semaphore object initializes the fence parameter passed as input. This fence parameter is waited upon by a wait operation that corresponds to the aforementioned signal. If the flags are set to cudaExternalSemaphoreSignalSkipNvSciBufMemSync then memory synchronization operations over all the imported NvSciBuf in this process that are executed as a part of the signal operation by default are skipped.
An imported NvSciSyncObj object can be waited upon as outlined below. Waiting on NvSciSync backed semaphore object waits until the input fence parameter is signaled by the corresponding signaler. Additionally, the signal must be issued before the wait can be issued. If the flags are set to cudaExternalSemaphoreWaitSkipNvSciBufMemSync then memory synchronization operations over all the imported NvSciBuf in this process that are executed as a part of the signal operation by default are skipped.
These schemes are difficult with CUDA graphs because of the non-fixed pointer or handle for the resource which requires indirection or graph update, and the synchronous CPU code needed each time the work is submitted. They also do not work with stream capture if these considerations are hidden from the caller of the library, and because of use of disallowed APIs during capture.
Various solutions exist such as exposing the resource to the caller. CUDA user objects present another approach. A typical use case would be to immediately move the sole user-owned reference to a CUDA graph after the user object is created. References owned by graphs in child graph nodes are associated to the child graphs, not the parents. If a child graph is updated or deleted, the references change accordingly. If an executable graph or child graph is updated with cudaGraphExecUpdate or cudaGraphExecChildGraphNodeSetParams , the references in the new source graph are cloned and replace the references in the target graph.
In either case, if previous launches are not synchronized, any references which would be released are held until the launches have finished executing.
Users may signal a synchronization object manually from the destructor code. This is to avoid blocking a CUDA internal shared thread and preventing forward progress. It is legal to signal another thread to perform an API call, if the dependency is one way and the thread doing the call cannot block forward progress of CUDA work.
There are two version numbers that developers should care about when developing a CUDA application: The compute capability that describes the general specifications and features of the compute device see Compute Capability and the version of the CUDA driver API that describes the features supported by the driver API and runtime.
It allows developers to check whether their application requires a newer device driver than the one currently installed. This is important, because the driver API is backward compatible , meaning that applications, plug-ins, and libraries including the CUDA runtime compiled against a particular version of the driver API will continue to work on subsequent device driver releases as illustrated in Figure The driver API is not forward compatible , which means that applications, plug-ins, and libraries including the CUDA runtime compiled against a particular version of the driver API will not work on previous versions of the device driver.
It is important to note that there are limitations on the mixing and matching of versions that is supported:. The requirements on the CUDA Driver version described here apply to the version of the user-mode components. On Tesla solutions running Windows Server and later or Linux, one can set any device in a system in one of the three following modes using NVIDIA's System Management Interface nvidia-smi , which is a tool distributed as part of the driver:.
This means, in particular, that a host thread using the runtime API without explicitly calling cudaSetDevice might be associated with a device other than device 0 if device 0 turns out to be in prohibited mode or in exclusive-process mode and used by another process. Note also that, for devices featuring the Pascal architecture onwards compute capability with major revision number 6 and higher , there exists support for Compute Preemption.
This allows compute tasks to be preempted at instruction-level granularity, rather than thread block granularity as in prior Maxwell and Kepler GPU architecture, with the benefit that applications with long-running kernels can be prevented from either monopolizing the system or timing out. However, there will be context switch overheads associated with Compute Preemption, which is automatically enabled on those devices for which support exists.
The individual attribute query function cudaDeviceGetAttribute with the attribute cudaDevAttrComputePreemptionSupported can be used to determine if the device in use supports Compute Preemption.
Users wishing to avoid context switch overheads associated with different processes can ensure that only one process is active on the GPU by selecting exclusive-process mode.
Applications may query the compute mode of a device by checking the computeMode device property see Device Enumeration. GPUs that have a display output dedicate some DRAM memory to the so-called primary surface , which is used to refresh the display device whose output is viewed by the user. When users initiate a mode switch of the display by changing the resolution or bit depth of the display using NVIDIA control panel or the Display control panel on Windows , the amount of memory needed for the primary surface changes.
For example, if the user changes the display resolution from xxbit to xxbit, the system must dedicate 7. Full-screen graphics applications running with anti-aliasing enabled may require much more display memory for the primary surface. If a mode switch increases the amount of memory needed for the primary surface, the system may have to cannibalize memory allocations dedicated to CUDA applications. Therefore, a mode switch results in any call to the CUDA runtime to fail and return an invalid context error.
When a CUDA program on the host CPU invokes a kernel grid, the blocks of the grid are enumerated and distributed to multiprocessors with available execution capacity. The threads of a thread block execute concurrently on one multiprocessor, and multiple thread blocks can execute concurrently on one multiprocessor. As thread blocks terminate, new blocks are launched on the vacated multiprocessors.
A multiprocessor is designed to execute hundreds of threads concurrently. The instructions are pipelined, leveraging instruction-level parallelism within a single thread, as well as extensive thread-level parallelism through simultaneous hardware multithreading as detailed in Hardware Multithreading.
Unlike CPU cores, they are issued in order and there is no branch prediction or speculative execution. SIMT Architecture and Hardware Multithreading describe the architecture features of the streaming multiprocessor that are common to all devices.
Compute Capability 3. The multiprocessor creates, manages, schedules, and executes threads in groups of 32 parallel threads called warps. Individual threads composing a warp start together at the same program address, but they have their own instruction address counter and register state and are therefore free to branch and execute independently. The term warp originates from weaving, the first parallel thread technology. A half-warp is either the first or second half of a warp.
A quarter-warp is either the first, second, third, or fourth quarter of a warp. When a multiprocessor is given one or more thread blocks to execute, it partitions them into warps and each warp gets scheduled by a warp scheduler for execution. The way a block is partitioned into warps is always the same; each warp contains threads of consecutive, increasing thread IDs with the first warp containing thread 0. Thread Hierarchy describes how thread IDs relate to thread indices in the block.
A warp executes one common instruction at a time, so full efficiency is realized when all 32 threads of a warp agree on their execution path. If threads of a warp diverge via a data-dependent conditional branch, the warp executes each branch path taken, disabling threads that are not on that path. Branch divergence occurs only within a warp; different warps execute independently regardless of whether they are executing common or disjoint code paths.
In contrast with SIMD vector machines, SIMT enables programmers to write thread-level parallel code for independent, scalar threads, as well as data-parallel code for coordinated threads. For the purposes of correctness, the programmer can essentially ignore the SIMT behavior; however, substantial performance improvements can be realized by taking care that the code seldom requires threads in a warp to diverge. In practice, this is analogous to the role of cache lines in traditional code: Cache line size can be safely ignored when designing for correctness but must be considered in the code structure when designing for peak performance.
Vector architectures, on the other hand, require the software to coalesce loads into vectors and manage divergence manually. Prior to Volta, warps used a single program counter shared amongst all 32 threads in the warp together with an active mask specifying the active threads of the warp.
As a result, threads from the same warp in divergent regions or different states of execution cannot signal each other or exchange data, and algorithms requiring fine-grained sharing of data guarded by locks or mutexes can easily lead to deadlock, depending on which warp the contending threads come from.
Starting with the Volta architecture, Independent Thread Scheduling allows full concurrency between threads, regardless of warp. With Independent Thread Scheduling, the GPU maintains execution state per thread, including a program counter and call stack, and can yield execution at a per-thread granularity, either to make better use of execution resources or to allow one thread to wait for data to be produced by another. A schedule optimizer determines how to group active threads from the same warp together into SIMT units.
Independent Thread Scheduling can lead to a rather different set of threads participating in the executed code than intended if the developer made assumptions about warp-synchronicity 2 of previous hardware architectures.
In particular, any warp-synchronous code such as synchronization-free, intra-warp reductions should be revisited to ensure compatibility with Volta and beyond. See Compute Capability 7. The threads of a warp that are participating in the current instruction are called the active threads, whereas threads not on the current instruction are inactive disabled.
Threads can be inactive for a variety of reasons including having exited earlier than other threads of their warp, having taken a different branch path than the branch path currently executed by the warp, or being the last threads of a block whose number of threads is not a multiple of the warp size.
If a non-atomic instruction executed by a warp writes to the same location in global or shared memory for more than one of the threads of the warp, the number of serialized writes that occur to that location varies depending on the compute capability of the device see Compute Capability 3.
The execution context program counters, registers, etc. Therefore, switching from one execution context to another has no cost, and at every instruction issue time, a warp scheduler selects a warp that has threads ready to execute its next instruction the active threads of the warp and issues the instruction to those threads.
In particular, each multiprocessor has a set of bit registers that are partitioned among the warps, and a parallel data cache or shared memory that is partitioned among the thread blocks. The number of blocks and warps that can reside and be processed together on the multiprocessor for a given kernel depends on the amount of registers and shared memory used by the kernel and the amount of registers and shared memory available on the multiprocessor.
There are also a maximum number of resident blocks and a maximum number of resident warps per multiprocessor. These limits as well the amount of registers and shared memory available on the multiprocessor are a function of the compute capability of the device and are given in Appendix Compute Capabilities.
If there are not enough registers or shared memory available per multiprocessor to process at least one block, the kernel will fail to launch. Which strategies will yield the best performance gain for a particular portion of an application depends on the performance limiters for that portion; optimizing instruction usage of a kernel that is mostly limited by memory accesses will not yield any significant performance gain, for example. Optimization efforts should therefore be constantly directed by measuring and monitoring the performance limiters, for example using the CUDA profiler.
Also, comparing the floating-point operation throughput or memory throughput—whichever makes more sense—of a particular kernel to the corresponding peak theoretical throughput of the device indicates how much room for improvement there is for the kernel. To maximize utilization the application should be structured in a way that it exposes as much parallelism as possible and efficiently maps this parallelism to the various components of the system to keep them busy most of the time.
At a high level, the application should maximize parallel execution between the host, the devices, and the bus connecting the host to the devices, by using asynchronous functions calls and streams as described in Asynchronous Concurrent Execution.
It should assign to each processor the type of work it does best: serial workloads to the host; parallel workloads to the devices. The second case is much less optimal since it adds the overhead of extra kernel invocations and global memory traffic. Its occurrence should therefore be minimized by mapping the algorithm to the CUDA programming model in such a way that the computations that require inter-thread communication are performed within a single thread block as much as possible.
At a lower level, the application should maximize parallel execution between the multiprocessors of a device. Multiple kernels can execute concurrently on a device, so maximum utilization can also be achieved by using streams to enable enough kernels to execute concurrently as described in Asynchronous Concurrent Execution. At an even lower level, the application should maximize parallel execution between the various functional units within a multiprocessor.
As described in Hardware Multithreading , a GPU multiprocessor primarily relies on thread-level parallelism to maximize utilization of its functional units. Utilization is therefore directly linked to the number of resident warps. At every instruction issue time, a warp scheduler selects an instruction that is ready to execute. This instruction can be another independent instruction of the same warp, exploiting instruction-level parallelism, or more commonly an instruction of another warp, exploiting thread-level parallelism.
If a ready to execute instruction is selected it is issued to the active threads of the warp. The number of clock cycles it takes for a warp to be ready to execute its next instruction is called the latency , and full utilization is achieved when all warp schedulers always have some instruction to issue for some warp at every clock cycle during that latency period, or in other words, when latency is completely "hidden".
The number of instructions required to hide a latency of L clock cycles depends on the respective throughputs of these instructions see Arithmetic Instructions for the throughputs of various arithmetic instructions. If we assume instructions with maximum throughput, it is equal to:. The most common reason a warp is not ready to execute its next instruction is that the instruction's input operands are not available yet. If all input operands are registers, latency is caused by register dependencies, i.
In this case, the latency is equal to the execution time of the previous instruction and the warp schedulers must schedule instructions of other warps during that time. Execution time varies depending on the instruction.
On devices of compute capability 7. This means that 16 active warps per multiprocessor 4 cycles, 4 warp schedulers are required to hide arithmetic instruction latencies assuming that warps execute instructions with maximum throughput, otherwise fewer warps are needed.
If the individual warps exhibit instruction-level parallelism, i. If some input operand resides in off-chip memory, the latency is much higher: typically hundreds of clock cycles. The number of warps required to keep the warp schedulers busy during such high latency periods depends on the kernel code and its degree of instruction-level parallelism. In general, more warps are required if the ratio of the number of instructions with no off-chip memory operands i.
Another reason a warp is not ready to execute its next instruction is that it is waiting at some memory fence Memory Fence Functions or synchronization point Synchronization Functions.
A synchronization point can force the multiprocessor to idle as more and more warps wait for other warps in the same block to complete execution of instructions prior to the synchronization point. Having multiple resident blocks per multiprocessor can help reduce idling in this case, as warps from different blocks do not need to wait for each other at synchronization points.
The number of blocks and warps residing on each multiprocessor for a given kernel call depends on the execution configuration of the call Execution Configuration , the memory resources of the multiprocessor, and the resource requirements of the kernel as described in Hardware Multithreading.
The total amount of shared memory required for a block is equal to the sum of the amount of statically allocated shared memory and the amount of dynamically allocated shared memory. The number of registers used by a kernel can have a significant impact on the number of resident warps. For example, for devices of compute capability 6. But as soon as the kernel uses one more register, only one block i.
Therefore, the compiler attempts to minimize register usage while keeping register spilling see Device Memory Accesses and the number of instructions to a minimum. Register usage can be controlled using the maxrregcount compiler option or launch bounds as described in Launch Bounds.
The register file is organized as bit registers. So, each variable stored in a register needs at least one bit register, for example, a double variable uses two bit registers.
The effect of execution configuration on performance for a given kernel call generally depends on the kernel code. Experimentation is therefore recommended. Applications can also parametrize execution configurations based on register file size and shared memory size, which depends on the compute capability of the device, as well as on the number of multiprocessors and memory bandwidth of the device, all of which can be queried using the runtime see reference manual.
The number of threads per block should be chosen as a multiple of the warp size to avoid wasting computing resources with under-populated warps as much as possible. Several API functions exist to assist programmers in choosing thread block size based on register and shared memory requirements. The following code sample calculates the occupancy of MyKernel.
It then reports the occupancy level with the ratio between concurrent warps versus maximum warps per multiprocessor. The following code sample configures an occupancy-based kernel launch of MyKernel according to the user input. A spreadsheet version of the occupancy calculator is also provided. The spreadsheet version is particularly useful as a learning tool that visualizes the impact of changes to the parameters that affect occupancy block size, registers per thread, and shared memory per thread.
The first step in maximizing overall memory throughput for the application is to minimize data transfers with low bandwidth. That means minimizing data transfers between the host and the device, as detailed in Data Transfer between Host and Device , since these have much lower bandwidth than data transfers between global memory and the device.
That also means minimizing data transfers between global memory and the device by maximizing use of on-chip memory: shared memory and caches i. Shared memory is equivalent to a user-managed cache: The application explicitly allocates and accesses it. As illustrated in CUDA Runtime , a typical programming pattern is to stage data coming from device memory into shared memory; in other words, to have each thread of a block:.
For some applications for example, for which global memory access patterns are data-dependent , a traditional hardware-managed cache is more appropriate to exploit data locality. As mentioned in Compute Capability 3. The throughput of memory accesses by a kernel can vary by an order of magnitude depending on access pattern for each type of memory. The next step in maximizing memory throughput is therefore to organize memory accesses as optimally as possible based on the optimal memory access patterns described in Device Memory Accesses.
This optimization is especially important for global memory accesses as global memory bandwidth is low compared to available on-chip bandwidths and arithmetic instruction throughput, so non-optimal global memory accesses generally have a high impact on performance. Applications should strive to minimize data transfer between the host and the device. One way to accomplish this is to move more code from the host to the device, even if that means running kernels that do not expose enough parallelism to execute on the device with full efficiency.
Intermediate data structures may be created in device memory, operated on by the device, and destroyed without ever being mapped by the host or copied to host memory. Also, because of the overhead associated with each transfer, batching many small transfers into a single large transfer always performs better than making each transfer separately.
On systems with a front-side bus, higher performance for data transfers between host and device is achieved by using page-locked host memory as described in Page-Locked Host Memory.
In addition, when using mapped page-locked memory Mapped Memory , there is no need to allocate any device memory and explicitly copy data between device and host memory. Data transfers are implicitly performed each time the kernel accesses the mapped memory. For maximum performance, these memory accesses must be coalesced as with accesses to global memory see Device Memory Accesses. Assuming that they are and that the mapped memory is read or written only once, using mapped page-locked memory instead of explicit copies between device and host memory can be a win for performance.
On integrated systems where device memory and host memory are physically the same, any copy between host and device memory is superfluous and mapped page-locked memory should be used instead.
Applications may query a device is integrated by checking that the integrated device property see Device Enumeration is equal to 1.
An instruction that accesses addressable memory i. How the distribution affects the instruction throughput this way is specific to each type of memory and described in the following sections. For example, for global memory, as a general rule, the more scattered the addresses are, the more reduced the throughput is. Global memory resides in device memory and device memory is accessed via , , or byte memory transactions. These memory transactions must be naturally aligned: Only the , , or byte segments of device memory that are aligned to their size i.
When a warp executes an instruction that accesses global memory, it coalesces the memory accesses of the threads within the warp into one or more of these memory transactions depending on the size of the word accessed by each thread and the distribution of the memory addresses across the threads.
In general, the more transactions are necessary, the more unused words are transferred in addition to the words accessed by the threads, reducing the instruction throughput accordingly. For example, if a byte memory transaction is generated for each thread's 4-byte access, throughput is divided by 8. How many transactions are necessary and how much throughput is ultimately affected varies with the compute capability of the device. To maximize global memory throughput, it is therefore important to maximize coalescing by:.
Global memory instructions support reading or writing words of size equal to 1, 2, 4, 8, or 16 bytes. Any access via a variable or a pointer to data residing in global memory compiles to a single global memory instruction if and only if the size of the data type is 1, 2, 4, 8, or 16 bytes and the data is naturally aligned i. If this size and alignment requirement is not fulfilled, the access compiles to multiple instructions with interleaved access patterns that prevent these instructions from fully coalescing.
It is therefore recommended to use types that meet this requirement for data that resides in global memory. The alignment requirement is automatically fulfilled for the Built-in Vector Types. Any address of a variable residing in global memory or returned by one of the memory allocation routines from the driver or runtime API is always aligned to at least bytes.
Reading non-naturally aligned 8-byte or byte words produces incorrect results off by a few words , so special care must be taken to maintain alignment of the starting address of any value or array of values of these types.
A typical case where this might be easily overlooked is when using some custom global memory allocation scheme, whereby the allocations of multiple arrays with multiple calls to cudaMalloc or cuMemAlloc is replaced by the allocation of a single large block of memory partitioned into multiple arrays, in which case the starting address of each array is offset from the block's starting address.
For these accesses to be fully coalesced, both the width of the thread block and the width of the array must be a multiple of the warp size. In particular, this means that an array whose width is not a multiple of this size will be accessed much more efficiently if it is actually allocated with a width rounded up to the closest multiple of this size and its rows padded accordingly. The cudaMallocPitch and cuMemAllocPitch functions and associated memory copy functions described in the reference manual enable programmers to write non-hardware-dependent code to allocate arrays that conform to these constraints.
Local memory accesses only occur for some automatic variables as mentioned in Variable Memory Space Specifiers. Automatic variables that the compiler is likely to place in local memory are:. Inspection of the PTX assembly code obtained by compiling with the -ptx or -keep option will tell if a variable has been placed in local memory during the first compilation phases as it will be declared using the.
Even if it has not, subsequent compilation phases might still decide otherwise though if they find it consumes too much register space for the targeted architecture: Inspection of the cubin object using cuobjdump will tell if this is the case. Note that some mathematical functions have implementation paths that might access local memory. The local memory space resides in device memory, so local memory accesses have the same high latency and low bandwidth as global memory accesses and are subject to the same requirements for memory coalescing as described in Device Memory Accesses.
Local memory is however organized such that consecutive bit words are accessed by consecutive thread IDs. Accesses are therefore fully coalesced as long as all threads in a warp access the same relative address for example, same index in an array variable, same member in a structure variable.
On some devices of compute capability 3. On devices of compute capability 5. Because it is on-chip, shared memory has much higher bandwidth and much lower latency than local or global memory. To achieve high bandwidth, shared memory is divided into equally-sized memory modules, called banks, which can be accessed simultaneously. Any memory read or write request made of n addresses that fall in n distinct memory banks can therefore be serviced simultaneously, yielding an overall bandwidth that is n times as high as the bandwidth of a single module.
However, if two addresses of a memory request fall in the same memory bank, there is a bank conflict and the access has to be serialized.
The hardware splits a memory request with bank conflicts into as many separate conflict-free requests as necessary, decreasing throughput by a factor equal to the number of separate memory requests. If the number of separate memory requests is n , the initial memory request is said to cause n -way bank conflicts. To get maximum performance, it is therefore important to understand how memory addresses map to memory banks in order to schedule the memory requests so as to minimize bank conflicts.
This is described in Compute Capability 3. The constant memory space resides in device memory and is cached in the constant cache. A request is then split into as many separate requests as there are different memory addresses in the initial request, decreasing throughput by a factor equal to the number of separate requests.
The resulting requests are then serviced at the throughput of the constant cache in case of a cache hit, or at the throughput of device memory otherwise. The texture and surface memory spaces reside in device memory and are cached in texture cache, so a texture fetch or surface read costs one memory read from device memory only on a cache miss, otherwise it just costs one read from texture cache. The texture cache is optimized for 2D spatial locality, so threads of the same warp that read texture or surface addresses that are close together in 2D will achieve best performance.
Also, it is designed for streaming fetches with a constant latency; a cache hit reduces DRAM bandwidth demand but not fetch latency. Reading device memory through texture or surface fetching present some benefits that can make it an advantageous alternative to reading device memory from global or constant memory:.
In this section, throughputs are given in number of operations per clock cycle per multiprocessor. All throughputs are for one multiprocessor. They must be multiplied by the number of multiprocessors in the device to get throughput for the whole device. Table 3 gives the throughputs of the arithmetic instructions that are natively supported in hardware for devices of various compute capabilities.
Other instructions and functions are implemented on top of the native instructions. The implementation may be different for devices of different compute capabilities, and the number of native instructions after compilation may fluctuate with every compiler version. For complicated functions, there can be multiple code paths depending on input. The nvcc user manual describes these compilation flags in more details.
To preserve IEEE semantics the compiler can optimize 1. It is therefore recommended to invoke rsqrtf directly where desired. Single-precision floating-point square root is implemented as a reciprocal square root followed by a reciprocal instead of a reciprocal square root followed by a multiplication so that it gives correct results for 0 and infinity.
More precisely, the argument reduction code see Mathematical Functions for implementation comprises two code paths referred to as the fast path and the slow path, respectively.
New Zealand. South Africa. Chinese Taipei. United Kingdom. United States. Select Product. DeckLink 8K Pro. DeckLink 4K Extreme 12G. DeckLink Studio 4K. DeckLink Quad 2. DeckLink Duo 2. DeckLink Duo 2 Mini. DeckLink Mini Recorder 4K.
DeckLink Mini Monitor 4K. DeckLink Mini Monitor. DeckLink Mini Recorder. DeckLink Micro Recorder. Download Manual Download Software. HDMI 2. Internal Software Upgrade Firmware built into software driver. Loaded at system start or via updater software. Wirecast vMix OBS. SD Video Standards i Mac Power Consumption 23 W. Colorspace Conversion Hardware based real time on output. Power Consumption 19 Watt. Automatically configures to connected display. Power Consumption 30 Watts.
External power cable. Supports 6G and 3G Audio Sampling Television standard sample rate of 48 kHz at 24 bit. Colorspace Conversion Hardware based real time. Power Consumption 22 Watts. HDMI bracket. DeckLink Quad 2 Technical Specifications. Audio Sampling Television standard sample rate of 48 kHz and 24 bit. HD Down Conversion Built in, high quality software down converter on playback and capture. DeckLink Duo 2 Technical Specifications.
Power Consumption 10 Watts. Power Consumption 8 Watts. Description Monitor uncompressed bit video from your computer! Video Sampling YUV Intel Macs only.
Power Consumption 4 Watts.
Comments
Post a Comment