Technical News from The Portland Group

CUDA Fortran Asynchronous Data Transfers

In PGI CUDA Fortran, data transfers in either direction between the host and device using Fortran assignment statements or the function cudaMemcpy() are blocking transfers; that is, control is returned to the host thread only after the data transfer is complete. The cudaMemcpyAsync() function is a non-blocking variant in which control is returned immediately to the host thread. In contrast with assignment statements or cudaMemcpy(), the asynchronous transfer version requires pinned host memory, and it contains an additional argument, a stream ID. A stream is simply a sequence of operations that are performed in order on the device. Operations in different streams can be interleaved and in some cases overlapped—a property that can be used to hide data transfers between the host and the device.

Asynchronous transfers enable overlap of data transfers with computation in two different ways. First, on all CUDA-enabled devices, it is possible to overlap host computation with asynchronous data transfers and with device computations. For example, the following code segment demonstrates how host computation in the routine cpuRoutine() is performed while data is transferred to the device and a kernel using the device is executed.

istat = cudaMemcpyAsync(a_d , a , nElements , 0)
call kernel <<<gridSize ,blockSize >>>(a_d)
call cpuRoutine(b)

The first three arguments of cudaMemcpyAsync() are the same as the three arguments to cudaMemcpy. The last argument is the stream ID, which in this case uses the default stream, stream 0. The kernel also uses the default stream, and it will not begin execution until the memory copy completes, therefore, no explicit synchronization is needed. Because the memory copy and the kernel both return control to the host immediately, the host subroutine cpuRoutine() can overlap their execution, although in the above example, the memory copy and kernel execution occur sequentially because they occupy the same stream.

On devices that are capable of "concurrent copy and execute," it is possible to overlap kernel execution on the device with data transfers between the host and the device. This is the second method of overlapping computation with data transfer and the main focus of this article. Whether a device has this capability is indicated by the deviceOverlap field of a cudaDeviceProp variable (or look for "Concurrent Copy and Execution" using pgaccelinfo). On devices that have this capability, the overlap once again requires pinned host memory, and, in addition, the data transfer and kernel must use different, non-default streams (streams with non-zero stream IDs). Non-default streams are required for this overlap because memory copy, memory set functions, and kernel calls that use the default stream begin only after all preceding calls on the device (in any stream) have completed, and no operation on the device (in any stream) commences until they are finished.

istat = cudaStreamCreate(stream1)
istat = cudaStreamCreate(stream2)
istat = cudaMemcpyAsync(a_d , a, nElements, stream1)
call kernel <<<gridSize ,blockSize ,0, stream2 >>>(b_d)

In this example, two streams are created and used in the data transfer and kernel executions as specified in the last arguments of the cudaMemcpyAsync() call and the kernels execution configuration. We make use of two device arrays, a_d and b_d, and assign work on a_d to stream1 and b_d to stream2.

If the operations on a single data array in a kernel are independent, then data can be broken into chunks and transferred in multiple stages, multiple kernels launched to operate on each chunk as it arrives, and each chunk's results transferred back to the host when the relevant kernel completes. The following code segments demonstrate two ways of breaking up data transfers and kernel work in order to hide transfer time. The source code for the entire program is available to download.

! baseline case - sequential transfer and execute
a = 0
istat = cudaEventRecord(startEvent ,0)
a_d = a
call kernel <<<n/blockSize , blockSize >>>(a_d, 0)
a = a_d
istat = cudaEventRecord(stopEvent , 0)

! Setup for multiple stream processing
strSize = n / nStreams
strGridSize = strSize / blocksize
i = 1, nStreams
  istat = cudaStreamCreate(stream(i))
enddo


! asynchronous version 1: loop over {copy, kernel, copy}   
a = 0
istat = cudaEventRecord(startEvent ,0)
do i = 1, nStreams
  offset = (i-1)* strSize
  istat = cudaMemcpyAsync(a_d(offset+1), a(offset+1), strSize, stream(i))
  call kernel <<<strGridSize, blockSize, 0, stream(i)>>>(a_d, offset)
  istat = cudaMemcpyAsync(a(offset+1), a_d(offset+1), strSize, stream(i))
enddo
istat = cudaEventRecord(stopEvent , 0)


! asynchronous version 2: loop over copy, loop over kernel, loop over copy
a = 0
istat = cudaEventRecord(startEvent ,0)
do i = 1, nStreams
  offset = (i-1)* strSize
  istat = cudaMemcpyAsync(a_d(offset+1), a(offset+1), strSize, stream(i))
enddo
do i = 1, nStreams
  offset = (i-1)* strSize
  call kernel <<<strGridSize, blockSize, 0, stream(i)>>>(a_d, offset)
enddo
do i = 1, nStreams
  offset = (i-1)* strSize
  istat = cudaMemcpyAsync(a(offset+1), a_d(offset+1), strSize ,stream(i))
enddo
istat = cudaEventRecord(stopEvent , 0)

The actual source code contains one other asynchronous version, version 3, which is identical to version 2, but the kernel loop contains this addition, which we will explain at the end of this article:

do i = 1, nStreams
  offset = (i-1)* strSize
  call kernel <<<strGridSize, blockSize, 0, stream(i)>>>(a_d, offset)
  ! Add a dummy event
  istat = cudaEventRecord(dummyEvent, stream(i))
enddo

The asynchronous cases are similar to the sequential case, only that there are multiple data transfers and kernel launches which are distinguished by different streams and an offset corresponding to the particular stream. In this code, we limit the number of streams to four, although for large arrays there is no reason why a larger number of streams could not be used. Note that the same kernel is used in the sequential and asynchronous cases in the code, as an offset is sent to the kernel to accomodate the data in different streams. The difference between the first two asynchronous versions is the order in which the copies and kernels are executed. The first version loops over each stream and for each stream issues a host-to-device copy, kernel, and device-to-host copy. The second version issues all host-to-device copies, then all kernel launches, and then all device-to-host copies.

At this point you may be asking why do we have three versions of the asynchronous case. The reason is that these variants perform differently on different hardware. Running this code on the NVIDIA Tesla C1060 produces:

Device: Tesla C1060
Time for sequential transfer and execute (ms): 12.92381
Time for asynchronous V1 transfer and execute (ms): 13.63690
Time for asynchronous V2 transfer and execute (ms): 8.845888
Time for asynchronous V3 transfer and execute (ms): 8.998560

and on the NVIDIA Tesla C2050 we get:

Device: Tesla C2050
Time for sequential transfer and execute (ms): 9.984512
Time for asynchronous V1 transfer and execute (ms): 5.735584
Time for asynchronous V2 transfer and execute (ms): 7.597984
Time for asynchronous V3 transfer and execute (ms): 5.735424

To decipher these results we need to understand a bit more about how devices schedule and execute various tasks. CUDA devices contain engines for various tasks, and operations are queued up in these engines as they are issued. Dependencies between tasks in different engines are maintained, but within any engine all dependence is lost, as tasks in an engine's queue are executed in the order they are issued by the host thread. For example, the C1060 has a single copy engine and a single kernel engine. For the above code, time lines for the execution on the device is schematically shown in the following Figure. In this schematic we have assumed that the time required for the host-to-device transfer, kernel execution, and device-to-host transfer are approximately the same, and in the code provided, a kernel was chosen in order to make these times comparable.

C1060 Execution Time Line


C1060 Execution Time Line

For the sequential kernel, there is no overlap in any of the operations as one would expect. For the first asynchronous version of our code the order of execution in the copy engine is: H2D stream(1), D2H stream(1), H2D stream(2), D2H stream(2), and so forth. This is why we do not see any speedup when using the first asynchronous version on the C1060: tasks were issued to the copy engine in an order that precludes any overlap of kernel execution and data transfer. For versions two and three, however, where all the host-to-device transfers are issued before any of the device-to-host transfers, overlap is possible as indicated by the lower execution time. From our schematic, we would expect the execution of versions two and three to be 8/12 of the sequential version, or 8.7 ms which is what is observed in the timing above.

On the C2050, two features interact to cause different behavior than that observed on the C1060. The C2050 has two copy engines, one for host-to-device transfers and another for device-to-host transfers, in addition to a single kernel engine. Having two copy engines explains why the first asynchronous version achieves good speedup on the C2050: the device-to-host transfer of data in stream(i) does not block the host-to-device transfer of data in stream(i+1) as it did on the C1060 because these two operations are in different engines on the C2050, which is schematically shown in the bottom diagram of the Figure. From the schematic we would expect the execution time to be cut in half relative to the sequential version, which is roughly what is observed in the timings above. This does not explain the performance degradation observed in the second asynchronous approach, however, which is related to the C2050's support to concurrently run multiple kernels. When multiple kernels are issued back-to-back, the scheduler tries to enable concurrent execution of these kernels and as a result delays a signal which normally occurs after each kernel completion (and is responsible for kicking off the device-to-host transfer) until all kernels complete. So, while there is overlap between host-to-device transfers and kernel execution in the second version of our asynchronous code, there is no overlap between kernel execution and device-to-host transfers. From the Figure one would expect an overall time for the second asynchronous version to be 9/12 of the time for the sequential version, or 7.5 ms which is what we observe from the timings above. This situation can be rectified by recording a dummy CUDA event between each kernel, which will inhibit concurrent kernel execution but enable overlap of data transfers and kernel execution, as is done in the third asynchronous version.

Real codes undoubtedly have more than one array which needs to be transferred, and more than one kernel to execute. But with some fundamental understanding of CUDA streams and the NVIDIA device transfer engines, the programmer should be able to find ways to overlap the cost of data transfers with computation on both the host and device.