Technical News from The Portland Group
Multi-GPU Programming Using CUDA Fortran, MPI, and GPUDirect—Part 1
In this article we'll show the basics of multi-GPU programing using CUDA Fortran, MPI and NVIDIA's GPUDirect. We'll use John Conway's Game of Life algorithm as the basis for the example program. We chose it because 1) it's fairly simple to explain and 2) it's representative of "nearest neighbor" algorithms where the program time steps through a loop evaluating each point in a grid based on the state of its neighbors. While each point can be evaluated in parallel, the processes must exchange boundaries (i.e. halos) when the grid is decomposed across multiple processes. This adds extra data movement which often inhibits performance.
Note if you are new to CUDA Fortran programing, you may benefit from reading our earlier articles:
- Introduction to PGI CUDA Fortran
- Understanding the CUDA Data Parallel Threading Model, A Primer
- Tuning a Monte Carlo Algorithm on GPUs
2. Game of Life
In the Game of Life, we have a two dimensional grid where each point, or "cell", on the grid will have one of two states, alive or dead. Although the start state of each cell can be set to either alive or dead, we decided to assign it randomly.
Once started, each live cell begins interacting with its eight neighbors. If the cell has one or no neighbors, it dies. If the live cell has four or more neighbors, it dies. If the live cell has two or three neighbors, it lives on. For dead cells, if exactly three of its neighbors are alive, the cell is born.
The game runs over a series of time steps where each cell is evaluated. A cell's neighbors are evaluated based on its current state. If a live cell dies, its neighboring cells are evaluated using its current state (live) rather than its new state (dead). The game continues until and ending condition is met. In this case, we chose to end the game once the cells stabilize (live cell count doesn't change for ten time steps), or after ten thousand steps.
The Basic Program
Full source code is available from the PGI website. Note that the final example uses Peer-to-Peer communication (GPUDirect) which is only available on NVIDIA devices with compute capability 2.0 or greater.
Here's a step-by-step description of the basic algorithm:
- Declare two, two-dimensional arrays, one to hold the current state of each cell, and one to hold the new state.
- Set the initial values in the current state array.
- Enter the time step loop
- Determine the new state for every cell (given the rules outlined above), and save this state in the new state array.
- Copy the new state array back to the current state array.
- If the end condition has been meet, exit the loop. Otherwise, continue to the next time step.
Cells on the edge of the grid can either wrap around to the other side, or the edge can be a dead zone. We chose the latter. When setting the initial state, we give each cell a 40% chance of starting out alive. (Edit the INIT_PERCENT variable in the Makefile to change this value.) We also use the host's random number generator for each of the versions described here. For the CUDA Fortran versions, it may be quicker to generate these random numbers on the GPU. However, to ensure that the results are consistent across versions, we'll use the same initial state.
Finally, We've also included a version that displays the cells state after each time step. It relies on basic OpenGL calls using the f90gl library as the Fortran interface. A TAR file of pre-built f90gl libraries for 64-bit Linux is also available from the PGI website. If you need to build your own, you might want to refer to PGI's Guide to building f90gl on the Porting and Tuning page of the PGI website.
The CUDA Fortran Program
Now lets change our program to take advantage of the GPU. Our algorithm is highly parallel because each cell state can be calculated independently. We'll take the body of our two DO loops to make the CUDA kernel and the DO loops themselves become our schedule. Let's start by converting the first DO loop to a CUDA kernel. Here's the original loop:
do, while (count.lt.10.and.steps.lt.10000) call life(A,B,Nx,Ny) A=B alive = sum(A) if (prev.lt.(alive+2).and.prev.gt.(alive-2)) then count = count + 1 else count = 0 endif prev = alive steps = steps + 1 end do
Here's the CUDA kernel translations:
dOld=A dNew=0 dimBlock = dim3(XBLOCKSIZE,YBLOCKSIZE,1) dimGrid = dim3((Nx+XBLOCKSIZE-1)/XBLOCKSIZE,(ny+YBLOCKSIZE-1)/YBLOCKSIZE,1) do, while (count.lt.10.and.steps.lt.10000) call life_kernel<<<dimGrid,dimBlock>>>(dOld,dNew,Nx,Ny) call partsum<<<SUMSIZE,SUMSIZE>>>(dNew,dPsum,dim) call finalsum<<<1,SUMSIZE>>>(dPsum,dFinalSum,dim) alive = dFinalSum if (prev.lt.(alive+2).and.prev.gt.(alive-2)) then count = count + 1 else count = 0 endif prev = alive steps = steps + 1 end do
Next, let's look at the second DO loop inside the life subroutine. Again, first the original code:
subroutine life(Old, New, Nx, Ny) implicit none integer, dimension(Nx,Ny) :: Old, New integer, value :: Nx,Ny integer :: ix, iy, neighbors, state do ix=2,Nx-1 do iy=2,Ny-1 neighbors = Old(ix,iy-1) + & Old(ix,iy+1) + & Old(ix+1,iy-1) + & Old(ix+1,iy+1) + & Old(ix-1,iy-1) + & Old(ix-1,iy+1) + & Old(ix-1,iy) + & Old(ix+1,iy) state = Old(ix,iy) if (state .eq. 0 .and. neighbors .eq. 3) then New(ix,iy) = 1 ! birth else if (state.gt.0.and.neighbors.ne.2.and.neighbors.ne.3) then New(ix,iy) = 0 ! death 0 else New(ix,iy) = state ! no change end if end do end do end subroutine life
And now the CUDA kernel version:
attributes(global) subroutine life_kernel(dOld, dNew, Nx, Ny) implicit none integer, dimension(Nx,Ny), device :: dOld, dNew integer, value :: Nx,Ny integer :: i, j, ix, iy, neighbors, state ix = (blockIdx%x-1)*blockDim%x + threadIdx%x iy = (blockIdx%y-1)*blockDim%y + threadIdx%y if (ix .gt. 1 .and. iy.gt.1 .and. & ix .lt. Nx .and. iy .lt. Ny) then neighbors = dOld(ix,iy-1) + & dOld(ix,iy+1) + & dOld(ix+1,iy-1) + & dOld(ix+1,iy+1) + & dOld(ix-1,iy-1) + & dOld(ix-1,iy+1) + & dOld(ix-1,iy) + & dOld(ix+1,iy) state = dOld(ix,iy) if (state .eq. 0 .and. neighbors .eq. 3) then dNew(ix,iy) = 1 ! birth else if (state.eq.1.and.neighbors.ne.2.and.neighbors.ne.3) then dNew(ix,iy) = 0 ! death else dNew(ix,iy) = state ! no change end if end if end subroutine life_kernel
Now we're ready to run these kernels on the GPU and see how much faster they run than on a CPU.
|Grid Size||Version||Time in Seconds|
Not bad, but recall our goal here is to find out if we can achieve even better performance by running on multiple GPUs. That's the next step. We'll start by trying MPI.
3. Using MPI
Setting Your Context
The first step in using MPI in CUDA Fortran is setting up your device Context. A Context is similar to a CPU process where an address space is created. All device memory allocations are mapped to this address space. In CUDA 4.0, the address space of a Context can span multiple devices via Unified Memory Addressing (UMA). This means a single host process or thread can use multiple devices. However, for the first MPI example, we'll only use one device per MPI process. Note that while it may "work", setting up more than one Context on a single device is not supported. For that reason and to avoid unexpected errors, we'll use a different device for each MPI process.
A Context is created upon first use of a device. So before we start using a device, we need to assign it to an MPI process using cudaSetDevice. Assuming that there are more than one MPI process per node, we'll need to gather the hostid from each node, and then distribute this information to all the processes so that we can determine how many other process are running on this node. Next, we'll determine how many devices there are on the node and finally we'll do the device assignment.
! determine who's on this node numlocal=0 localprocs=0 do i=1,nprocs if (hostid .eq. hostids(i)) then localprocs(i)=numlocal numlocal = numlocal+1 endif enddo ! get the number of devices on this node ierr=cudaGetDeviceCount(numdev) if (numdev .lt. 1) then print *, 'ERROR: There are no devices available on this host. ABORTING.', & myrank stop endif ! Print a warning if the number of devices is less than the number ! of processes on this node. NVIDIA does not support more than one ! Context on a single device. Having multiple processes share the ! same device is not recommended. It may work, but can also fail. if (numdev .lt. numlocal) then if (localprocs(myrank+1).eq.1) then ! print the message only once per node print *, 'WARNING: The number of local process is greater than the number & of devices.', myrank endif mydev = mod(localprocs(myrank+1),numdev) else mydev = localprocs(myrank+1) endif ierr = cudaSetDevice(mydev)
Next we'll add code to communicate between MPI processes. Like the serial version, the master process generates the initial data set, decomposes the grid across the second dimension, and then distributes the sections to each MPI process. Again, we do this to ensure we generate the same data set for all versions of this code which we can then use to verify results. While the cost to distribute the sections is minimal, the maximum grid size is limited. For larger grids, each node should generate its own data set. The code to distribute the sections is standard MPI so we won't bother reviewing it here.
During each time step, we need to pass the grid section boundaries (i.e. the halos) between the MPI processes. Because this is done from host code, we need to copy the halos back from the device to the host. Copying multi-dimensional array sections, i.e. hArr(l:k,n:m)=dArr(l:k,n:m), is not well optimized because copying between device and host must be done either in contiguous blocks requiring multiple copies or using the entire array and then performing the assignment. While the halo we are copying is contiguous, the code required to detect this would need to be performed at runtime and would add significant overhead.
For best performance, avoid copying array sections using the assignment operator. If the data is contiguous, you can call cudaMemcpy directly. A more general purpose approach is to first gather the 2D section on the device into a 1D array and then copy that 1D array in a single operation. While this does add some computational overhead, it works well for both contiguous as well as non-contiguous array sections.
... attributes(global) subroutine gather2Dto1D (dOld, OneD, Dim, Nx, Ny) implicit none integer, dimension(Nx,Ny), device :: dOld integer, dimension(Nx), device :: OneD integer, value :: Dim,Nx,Ny integer :: ix ix = (blockIdx%x-1)*blockDim%x + threadIdx%x if (ix.lt.Nx) then OneD(ix) = dOld(ix,Dim) endif end subroutine attributes(global) subroutine scatter1Dto2D (dOld, OneD, Dim, Nx, Ny) implicit none integer, dimension(Nx,Ny), device :: dOld integer, dimension(Nx), device :: OneD integer, value :: Dim,Nx,Ny integer :: ix ix = (blockIdx%x-1)*blockDim%x + threadIdx%x if (ix.lt.Nx) then dOld(ix,Dim)=OneD(ix) endif end subroutine ... if (myid.ne.0) then call gather2Dto1D<<<dimGrid2,dimBlock2>>>(dOld,dLeft,2,Nx,Nt) oldLeft=dLeft CALL MPI_ISEND(oldLeft(:),Nx,MPI_INTEGER,myid-1,tag,MPI_COMM_WORLD,req(1),ierr) endif if (myid.ne.(numprocs-1)) then call gather2Dto1D<<<dimGrid2,dimBlock2>>>(dOld,dRight,Nt-1,Nx,Nt) oldRight=dRight CALL MPI_ISEND(oldRight(:),Nx,MPI_INTEGER,myid+1,tag,MPI_COMM_WORLD,req(2),ierr) endif if (myid.ne.(numprocs-1)) then CALL MPI_RECV(newRight(:),Nx,MPI_INTEGER,myid+1,tag,MPI_COMM_WORLD,status,ierr) dRight=newRight call scatter1Dto2D<<<dimGrid2,dimBlock2>>>(dOld,dRight,Nt,Nx,Nt) endif if (myid.ne.0) then CALL MPI_RECV(newLeft(:),Nx,MPI_INTEGER,myid-1,tag,MPI_COMM_WORLD,status,ierr) dLeft=newLeft call scatter1Dto2D<<<dimGrid2,dimBlock2>>>(dOld,dLeft,1,Nx,Nt) endif
4. Using Peer-to-Peer Transfers
In CUDA 4.0, NVIDIA introduced GPUDirect which allows direct access to GPU memory from network devices, solid-state drives, and other devices. Note that GPUDirect is only supported on devices with compute capability 2.0 (CC20) or greater. This combined with the ability to have a single Context across multiple devices gives us the option to modify our program from using a single MPI process per device, to using one MPI process per node and using Peer-to-Peer to transfer halos directly between all the devices on the node.
Setting Up The Devices
Now that we will be using a single MPI Process to manage all the devices on the node, we no longer need the cudasetDevice routine from our MPI only version. However, we now need to replace it with code to discover how many devices are on the node—specifically CC20 devices—and then establish Peer-to-Peer communication between the devices.
! establish the Peer-to-Peer connections do i=1, numdev-1 do j=i+1,numdev ierr = cudaDeviceCanAccessPeer(accessPeer1, devices(i), devices(j)) ierr = cudaDeviceCanAccessPeer(accessPeer2, devices(j), devices(i)) if (accessPeer1 .eq. 1 .and. accessPeer2 .eq. 1 ) then write(*,"('Peer-to-Peer capable between ',i0, ' and ', i0)") & devices(i), devices(j) ! enable peer access required for copies ierr = cudaSetDevice(devices(i)) ierr = cudaDeviceEnablePeerAccess(devices(j), 0) ierr = cudaSetDevice(devices(j)) ierr = cudaDeviceEnablePeerAccess(devices(i), 0) else write(*,"('Peer-to-Peer not capable between ',i0, ' and ', i0)") & devices(i), devices(j) stop endif enddo enddo
With Peer-to-Peer communication established, the CUDA device driver now has a Unified Memory Address (UMA) space for all devices on the node. So now when we allocate data on our devices, each address will be unique. However, we still need each device to have its own grid section allocated in memory. In other words, for the host code, the grid section can't be a single array that spans multiple devices. Instead, it must divide the grid section into multiple arrays, one for each device.
While we could assume that each node has a specific number of devices, a more general purpose approach is to use a User Defined Type to contain our device arrays and then dynamically allocate an array of these types. Below is a deviceArray type containing the data needed for each device.
type deviceArray integer :: devnum, alive integer, allocatable, dimension(:,:), device :: dOld, dNew integer, allocatable, dimension(:) , device :: dLeft, dRight, dPsum integer, allocatable, device :: dFinalSum integer :: stream end type deviceArray
Next, we'll allocate and initialize our device data. Be sure you set the device before performing any device operation. Otherwise, the operation may be performed on the wrong device!
allocate(DA(numdev)) do i=1,numdev DA(i)%devnum=devices(i) ierr = cudaSetDevice(DA(i)%devnum) allocate(DA(i)%dOld(Nx,Ns)) allocate(DA(i)%dNew(Nx,Ns)) allocate(DA(i)%dLeft(Nx)) allocate(DA(i)%dRight(Nx)) allocate(DA(i)%dPsum(SUMSIZE)) allocate(DA(i)%dFinalSum) ierr = cudaStreamCreate(DA(i)%stream) start=((i-1)*(NsBS))+1 end = start+NsBS+1 DA(i)%dNew = 0 DA(i)%dLeft = 0 DA(i)%dRight = 0 DA(i)%dOld(1:Nx,1:Ns) = A(1:Nx,start:end) enddo
While the kernel calls in our main loop don't change, we do need to execute them multiple times.
! launch the main kernel do i=1,numdev ierr = cudaSetDevice(DA(i)%devnum) call life_kernel<<<dimGrid,dimBlock,0,DA(i)%stream>>> & (DA(i)%dOld,DA(i)%dNew,Nx,Ns) enddo
Calling a CUDA kernel is asynchronous to the host. This means the host code doesn't wait (block) for the CUDA kernel to finish before continuing. This allows the life_kernel to launch almost simultaneously on each device. However, by default the host code will block when copying data back to the host. For example, if we copied back the final sum right after calling our finalsum kernel, the host code would block causing the finalsum kernels to be executed sequentially. To perform non-blocking copies, use cudaMemcpyAsync instead of using the assignment operator or calling cudaMemcpy.
! perform the sum reduction do i=1,numdev ierr = cudaSetDevice(DA(i)%devnum) call partsum<<<SUMSIZE,SUMSIZE,0,DA(i)%stream>>>(DA(i)%dNew,DA(i)%dPsum,dim) enddo do i=1,numdev ierr = cudaSetDevice(DA(i)%devnum) call finalsum<<<1,SUMSIZE,0,DA(i)%stream>>>(DA(i)%dPsum,DA(i)%dFinalSum,dim) enddo alive = 0 do i=1,numdev ierr = cudaSetDevice(DA(i)%devnum) DA(i)%alive=DA(i)%dFinalSum alive = alive + DA(i)%alive ierr = cudaMemcpyAsync(DA(i)%dOld,DA(i)%dNew,dim, & cudaMemcpyDeviceToDevice,DA(i)%stream) enddo
Note we don't need to specify explicit synchronization between the partsum and finalsum kernels here because they are launched on the same stream.
Finally we need to pass halos between devices. We'll do that by calling cudaMemcpyPeer.
! swap halos between device arrays ! Right do i=1,numdev-1 ierr = cudaMemcpyPeer(DA(i+1)%dOld(1:Nx,1), DA(i+1)%devnum, & DA(i )%dOld(1:Nx,Ns-1),DA(i)%devnum, Nx) enddo ! Left do i=2,numdev ierr = cudaMemcpyPeer(DA(i-1)%dOld(1:Nx,Ns), DA(i-1)%devnum, & DA(i )%dOld(1:Nx,2), DA(i)%devnum, Nx) enddo
5. Performance Results
Below are the performance results shown earlier augmented with the two MPI tests. The tests were run on a Intel Core i7 2.7Ghz system with two NVIDIA Telsa C2060 GPU devices at various input sizes.
|Grid Size||Version||Time in Seconds|
|2 MPI 2 GPUs||10|
|1 MPI 2 GPUs w P2P||9|
|2 MPI 2 GPUs||40|
|1 MPI 2 GPUs w P2P||40|
|2 MPI 2 GPUs||604|
|1 MPI 2 GPUs w P2P||593|
For this example, both versions see good scaling between one and two GPUs. While not quite a 2x speed-up, this is to be expected because of the extra overhead incurred when copying data between devices. We saw only a slight performance gain using the Peer-to-Peer version compared to the MPI only version. However, we expect P2P performance to increase as more GPUs are added. In Part 2, we'll investigate this further by running both MPI versions on a large GPU enabled cluster with GPUDirect enabled network cards. Stay tuned!