Technical News from The Portland Group

Multi-GPU Programming Using CUDA Fortran, MPI, and GPUDirect—Part 1

1. Introduction

In this article we'll show the basics of multi-GPU programing using CUDA Fortran, MPI and NVIDIA's GPUDirect. We'll use John Conway's Game of Life algorithm as the basis for the example program. We chose it because 1) it's fairly simple to explain and 2) it's representative of "nearest neighbor" algorithms where the program time steps through a loop evaluating each point in a grid based on the state of its neighbors. While each point can be evaluated in parallel, the processes must exchange boundaries (i.e. halos) when the grid is decomposed across multiple processes. This adds extra data movement which often inhibits performance.

Note if you are new to CUDA Fortran programing, you may benefit from reading our earlier articles:

2. Game of Life

In the Game of Life, we have a two dimensional grid where each point, or "cell", on the grid will have one of two states, alive or dead. Although the start state of each cell can be set to either alive or dead, we decided to assign it randomly.

Once started, each live cell begins interacting with its eight neighbors. If the cell has one or no neighbors, it dies. If the live cell has four or more neighbors, it dies. If the live cell has two or three neighbors, it lives on. For dead cells, if exactly three of its neighbors are alive, the cell is born.

The game runs over a series of time steps where each cell is evaluated. A cell's neighbors are evaluated based on its current state. If a live cell dies, its neighboring cells are evaluated using its current state (live) rather than its new state (dead). The game continues until and ending condition is met. In this case, we chose to end the game once the cells stabilize (live cell count doesn't change for ten time steps), or after ten thousand steps.

The Basic Program

Playing the Game of Life Full source code is available from the PGI website. Note that the final example uses Peer-to-Peer communication (GPUDirect) which is only available on NVIDIA devices with compute capability 2.0 or greater.

Here's a step-by-step description of the basic algorithm:

  1. Declare two, two-dimensional arrays, one to hold the current state of each cell, and one to hold the new state.
  2. Set the initial values in the current state array.
  3. Enter the time step loop
  4. Determine the new state for every cell (given the rules outlined above), and save this state in the new state array.
  5. Copy the new state array back to the current state array.
  6. If the end condition has been meet, exit the loop. Otherwise, continue to the next time step.

Cells on the edge of the grid can either wrap around to the other side, or the edge can be a dead zone. We chose the latter. When setting the initial state, we give each cell a 40% chance of starting out alive. (Edit the INIT_PERCENT variable in the Makefile to change this value.) We also use the host's random number generator for each of the versions described here. For the CUDA Fortran versions, it may be quicker to generate these random numbers on the GPU. However, to ensure that the results are consistent across versions, we'll use the same initial state.

Finally, We've also included a version that displays the cells state after each time step. It relies on basic OpenGL calls using the f90gl library as the Fortran interface. A TAR file of pre-built f90gl libraries for 64-bit Linux is also available from the PGI website. If you need to build your own, you might want to refer to PGI's Guide to building f90gl on the Porting and Tuning page of the PGI website.

The CUDA Fortran Program

Now lets change our program to take advantage of the GPU. Our algorithm is highly parallel because each cell state can be calculated independently. We'll take the body of our two DO loops to make the CUDA kernel and the DO loops themselves become our schedule. Let's start by converting the first DO loop to a CUDA kernel. Here's the original loop:

      do, while (count.lt.10.and.steps.lt.10000)
          call life(A,B,Nx,Ny)
          A=B
          alive = sum(A)
          if (prev.lt.(alive+2).and.prev.gt.(alive-2)) then
              count = count + 1
          else
              count = 0
          endif
          prev = alive
          steps = steps + 1
      end do

Here's the CUDA kernel translations:

     dOld=A
     dNew=0
     dimBlock = dim3(XBLOCKSIZE,YBLOCKSIZE,1)
     dimGrid = dim3((Nx+XBLOCKSIZE-1)/XBLOCKSIZE,(ny+YBLOCKSIZE-1)/YBLOCKSIZE,1)
     do, while (count.lt.10.and.steps.lt.10000)
         call life_kernel<<<dimGrid,dimBlock>>>(dOld,dNew,Nx,Ny)
         call partsum<<<SUMSIZE,SUMSIZE>>>(dNew,dPsum,dim)
         call finalsum<<<1,SUMSIZE>>>(dPsum,dFinalSum,dim)
         alive = dFinalSum
         if (prev.lt.(alive+2).and.prev.gt.(alive-2)) then
            count = count + 1
         else
            count = 0
         endif
         prev = alive
         steps = steps + 1
     end do

Next, let's look at the second DO loop inside the life subroutine. Again, first the original code:

subroutine life(Old, New, Nx, Ny)

     implicit none
     integer, dimension(Nx,Ny) :: Old, New
     integer, value :: Nx,Ny
     integer :: ix, iy, neighbors, state
     do ix=2,Nx-1
       do iy=2,Ny-1
           
         neighbors = Old(ix,iy-1) + &
                     Old(ix,iy+1) + &
                     Old(ix+1,iy-1) + &   
                     Old(ix+1,iy+1) + &   
                     Old(ix-1,iy-1) + &
                     Old(ix-1,iy+1) + &
                     Old(ix-1,iy) + &
                     Old(ix+1,iy)

         state = Old(ix,iy)

         if (state .eq. 0 .and. neighbors .eq. 3) then
           New(ix,iy) = 1  ! birth
         else if (state.gt.0.and.neighbors.ne.2.and.neighbors.ne.3) then
           New(ix,iy) = 0  ! death 0
         else
           New(ix,iy) = state ! no change
         end if
       end do
     end do
end subroutine life

And now the CUDA kernel version:

attributes(global) subroutine life_kernel(dOld, dNew, Nx, Ny)

     implicit none
     integer, dimension(Nx,Ny), device :: dOld, dNew
     integer, value :: Nx,Ny
     integer :: i, j, ix, iy, neighbors, state

     ix = (blockIdx%x-1)*blockDim%x + threadIdx%x
     iy = (blockIdx%y-1)*blockDim%y + threadIdx%y

     if (ix .gt. 1 .and. iy.gt.1 .and. &
         ix .lt. Nx .and. iy .lt. Ny) then

       neighbors = dOld(ix,iy-1) + &
                   dOld(ix,iy+1) + &
                   dOld(ix+1,iy-1) + &
                   dOld(ix+1,iy+1) + &
                   dOld(ix-1,iy-1) + &
                   dOld(ix-1,iy+1) + &
                   dOld(ix-1,iy) + &
                   dOld(ix+1,iy)

       state = dOld(ix,iy)

       if (state .eq. 0 .and. neighbors .eq. 3) then
         dNew(ix,iy) = 1  ! birth
       else if (state.eq.1.and.neighbors.ne.2.and.neighbors.ne.3) then
         dNew(ix,iy) = 0  ! death
       else
         dNew(ix,iy) = state ! no change
       end if
     end if
end subroutine life_kernel

Now we're ready to run these kernels on the GPU and see how much faster they run than on a CPU.


Grid Size Version Time in Seconds
1024x1024 CPU (serial) 55
GPU (single) 8
4096x4096 CPU (serial) 872
GPU (single) 72
16384x16384 CPU (serial) 13926
GPU (single) 1170

Not bad, but recall our goal here is to find out if we can achieve even better performance by running on multiple GPUs. That's the next step. We'll start by trying MPI.

3. Using MPI

Setting Your Context

The first step in using MPI in CUDA Fortran is setting up your device Context. A Context is similar to a CPU process where an address space is created. All device memory allocations are mapped to this address space. In CUDA 4.0, the address space of a Context can span multiple devices via Unified Memory Addressing (UMA). This means a single host process or thread can use multiple devices. However, for the first MPI example, we'll only use one device per MPI process. Note that while it may "work", setting up more than one Context on a single device is not supported. For that reason and to avoid unexpected errors, we'll use a different device for each MPI process.

A Context is created upon first use of a device. So before we start using a device, we need to assign it to an MPI process using cudaSetDevice. Assuming that there are more than one MPI process per node, we'll need to gather the hostid from each node, and then distribute this information to all the processes so that we can determine how many other process are running on this node. Next, we'll determine how many devices there are on the node and finally we'll do the device assignment.

! determine who's on this node
     numlocal=0
     localprocs=0
     do i=1,nprocs
       if (hostid .eq. hostids(i)) then
         localprocs(i)=numlocal
         numlocal = numlocal+1
       endif
     enddo

! get the number of devices on this node
     ierr=cudaGetDeviceCount(numdev)

     if (numdev .lt. 1) then
       print *, 'ERROR: There are no devices available on this host.  ABORTING.', &
       myrank
       stop
     endif

! Print a warning if the number of devices is less than the number
! of processes on this node.  NVIDIA does not support more than one 
! Context on a single device.  Having multiple processes share the
! same device is not recommended.  It may work, but can also fail.

     if (numdev .lt. numlocal) then
       if (localprocs(myrank+1).eq.1) then
         ! print the message only once per node
         print *, 'WARNING: The number of local process is greater than the number &
         of devices.', myrank
       endif
       mydev = mod(localprocs(myrank+1),numdev)
     else
       mydev = localprocs(myrank+1)
     endif

     ierr = cudaSetDevice(mydev)

Process Communication

Next we'll add code to communicate between MPI processes. Like the serial version, the master process generates the initial data set, decomposes the grid across the second dimension, and then distributes the sections to each MPI process. Again, we do this to ensure we generate the same data set for all versions of this code which we can then use to verify results. While the cost to distribute the sections is minimal, the maximum grid size is limited. For larger grids, each node should generate its own data set. The code to distribute the sections is standard MPI so we won't bother reviewing it here.

During each time step, we need to pass the grid section boundaries (i.e. the halos) between the MPI processes. Because this is done from host code, we need to copy the halos back from the device to the host. Copying multi-dimensional array sections, i.e. hArr(l:k,n:m)=dArr(l:k,n:m), is not well optimized because copying between device and host must be done either in contiguous blocks requiring multiple copies or using the entire array and then performing the assignment. While the halo we are copying is contiguous, the code required to detect this would need to be performed at runtime and would add significant overhead.

For best performance, avoid copying array sections using the assignment operator. If the data is contiguous, you can call cudaMemcpy directly. A more general purpose approach is to first gather the 2D section on the device into a 1D array and then copy that 1D array in a single operation. While this does add some computational overhead, it works well for both contiguous as well as non-contiguous array sections.

...
attributes(global) subroutine gather2Dto1D (dOld, OneD, Dim, Nx, Ny)

     implicit none
     integer, dimension(Nx,Ny), device :: dOld
     integer, dimension(Nx), device :: OneD
     integer, value :: Dim,Nx,Ny
     integer :: ix

     ix = (blockIdx%x-1)*blockDim%x + threadIdx%x
     if (ix.lt.Nx) then
       OneD(ix) = dOld(ix,Dim)
     endif

end subroutine

attributes(global) subroutine scatter1Dto2D (dOld, OneD, Dim, Nx, Ny)

     implicit none
     integer, dimension(Nx,Ny), device :: dOld
     integer, dimension(Nx), device :: OneD
     integer, value :: Dim,Nx,Ny
     integer :: ix

     ix = (blockIdx%x-1)*blockDim%x + threadIdx%x
     if (ix.lt.Nx) then
       dOld(ix,Dim)=OneD(ix)
     endif

end subroutine
...

     if (myid.ne.0) then   
       call gather2Dto1D<<<dimGrid2,dimBlock2>>>(dOld,dLeft,2,Nx,Nt)
       oldLeft=dLeft
       CALL MPI_ISEND(oldLeft(:),Nx,MPI_INTEGER,myid-1,tag,MPI_COMM_WORLD,req(1),ierr)
     endif     
 
     if (myid.ne.(numprocs-1)) then
       call gather2Dto1D<<<dimGrid2,dimBlock2>>>(dOld,dRight,Nt-1,Nx,Nt)
       oldRight=dRight
       CALL MPI_ISEND(oldRight(:),Nx,MPI_INTEGER,myid+1,tag,MPI_COMM_WORLD,req(2),ierr)
     endif

     if (myid.ne.(numprocs-1)) then
       CALL MPI_RECV(newRight(:),Nx,MPI_INTEGER,myid+1,tag,MPI_COMM_WORLD,status,ierr)
       dRight=newRight
       call scatter1Dto2D<<<dimGrid2,dimBlock2>>>(dOld,dRight,Nt,Nx,Nt)
     endif

     if (myid.ne.0) then
       CALL MPI_RECV(newLeft(:),Nx,MPI_INTEGER,myid-1,tag,MPI_COMM_WORLD,status,ierr)
       dLeft=newLeft
       call scatter1Dto2D<<<dimGrid2,dimBlock2>>>(dOld,dLeft,1,Nx,Nt)
     endif

4. Using Peer-to-Peer Transfers

In CUDA 4.0, NVIDIA introduced GPUDirect which allows direct access to GPU memory from network devices, solid-state drives, and other devices. Note that GPUDirect is only supported on devices with compute capability 2.0 (CC20) or greater. This combined with the ability to have a single Context across multiple devices gives us the option to modify our program from using a single MPI process per device, to using one MPI process per node and using Peer-to-Peer to transfer halos directly between all the devices on the node.

Setting Up The Devices

Now that we will be using a single MPI Process to manage all the devices on the node, we no longer need the cudasetDevice routine from our MPI only version. However, we now need to replace it with code to discover how many devices are on the node—specifically CC20 devices—and then establish Peer-to-Peer communication between the devices.

! establish the Peer-to-Peer connections
     do i=1, numdev-1
       do j=i+1,numdev
         ierr = cudaDeviceCanAccessPeer(accessPeer1, devices(i), devices(j))
         ierr = cudaDeviceCanAccessPeer(accessPeer2, devices(j), devices(i))
         if (accessPeer1 .eq. 1 .and. accessPeer2 .eq. 1 ) then
           write(*,"('Peer-to-Peer capable between ',i0, ' and ', i0)") &
             devices(i), devices(j)
           ! enable peer access required for copies
           ierr = cudaSetDevice(devices(i))  
           ierr = cudaDeviceEnablePeerAccess(devices(j), 0) 
           ierr = cudaSetDevice(devices(j))
           ierr = cudaDeviceEnablePeerAccess(devices(i), 0)  
         else
           write(*,"('Peer-to-Peer not capable between ',i0, ' and ', i0)") &
             devices(i), devices(j)
           stop
         endif
       enddo
    enddo

Data Structures

With Peer-to-Peer communication established, the CUDA device driver now has a Unified Memory Address (UMA) space for all devices on the node. So now when we allocate data on our devices, each address will be unique. However, we still need each device to have its own grid section allocated in memory. In other words, for the host code, the grid section can't be a single array that spans multiple devices. Instead, it must divide the grid section into multiple arrays, one for each device.

While we could assume that each node has a specific number of devices, a more general purpose approach is to use a User Defined Type to contain our device arrays and then dynamically allocate an array of these types. Below is a deviceArray type containing the data needed for each device.

type deviceArray
  integer :: devnum, alive
  integer, allocatable, dimension(:,:), device :: dOld, dNew
  integer, allocatable, dimension(:)  , device :: dLeft, dRight, dPsum
  integer, allocatable, device                 :: dFinalSum
  integer                                      :: stream
end type deviceArray

Next, we'll allocate and initialize our device data. Be sure you set the device before performing any device operation. Otherwise, the operation may be performed on the wrong device!

     allocate(DA(numdev))
     do i=1,numdev
       DA(i)%devnum=devices(i)
       ierr = cudaSetDevice(DA(i)%devnum)
       allocate(DA(i)%dOld(Nx,Ns))
       allocate(DA(i)%dNew(Nx,Ns))
       allocate(DA(i)%dLeft(Nx))
       allocate(DA(i)%dRight(Nx))
       allocate(DA(i)%dPsum(SUMSIZE))
       allocate(DA(i)%dFinalSum)
       ierr = cudaStreamCreate(DA(i)%stream)
       start=((i-1)*(NsBS))+1
       end = start+NsBS+1
       DA(i)%dNew = 0
       DA(i)%dLeft = 0
       DA(i)%dRight = 0
       DA(i)%dOld(1:Nx,1:Ns) = A(1:Nx,start:end)
     enddo

Asynchronous Execution

While the kernel calls in our main loop don't change, we do need to execute them multiple times.

! launch the main kernel
     do i=1,numdev
       ierr = cudaSetDevice(DA(i)%devnum)
       call life_kernel<<<dimGrid,dimBlock,0,DA(i)%stream>>> &
       (DA(i)%dOld,DA(i)%dNew,Nx,Ns)
     enddo

Calling a CUDA kernel is asynchronous to the host. This means the host code doesn't wait (block) for the CUDA kernel to finish before continuing. This allows the life_kernel to launch almost simultaneously on each device. However, by default the host code will block when copying data back to the host. For example, if we copied back the final sum right after calling our finalsum kernel, the host code would block causing the finalsum kernels to be executed sequentially. To perform non-blocking copies, use cudaMemcpyAsync instead of using the assignment operator or calling cudaMemcpy.

! perform the sum reduction
     do i=1,numdev
       ierr = cudaSetDevice(DA(i)%devnum)
       call partsum<<<SUMSIZE,SUMSIZE,0,DA(i)%stream>>>(DA(i)%dNew,DA(i)%dPsum,dim)
     enddo

     do i=1,numdev
       ierr = cudaSetDevice(DA(i)%devnum)
       call finalsum<<<1,SUMSIZE,0,DA(i)%stream>>>(DA(i)%dPsum,DA(i)%dFinalSum,dim)
     enddo
       
     alive = 0
     do i=1,numdev
       ierr = cudaSetDevice(DA(i)%devnum)
       DA(i)%alive=DA(i)%dFinalSum
       alive = alive + DA(i)%alive
       ierr = cudaMemcpyAsync(DA(i)%dOld,DA(i)%dNew,dim, &
         cudaMemcpyDeviceToDevice,DA(i)%stream)
     enddo

Note we don't need to specify explicit synchronization between the partsum and finalsum kernels here because they are launched on the same stream.

Passing Halos

Finally we need to pass halos between devices. We'll do that by calling cudaMemcpyPeer.

! swap halos between device arrays
!  Right
     do i=1,numdev-1
       ierr = cudaMemcpyPeer(DA(i+1)%dOld(1:Nx,1), DA(i+1)%devnum, &
         DA(i  )%dOld(1:Nx,Ns-1),DA(i)%devnum, Nx)
     enddo

!  Left
     do i=2,numdev
       ierr = cudaMemcpyPeer(DA(i-1)%dOld(1:Nx,Ns), DA(i-1)%devnum, &
         DA(i  )%dOld(1:Nx,2), DA(i)%devnum, Nx)
     enddo

5. Performance Results

Below are the performance results shown earlier augmented with the two MPI tests. The tests were run on a Intel Core i7 2.7Ghz system with two NVIDIA Telsa C2060 GPU devices at various input sizes.


Grid Size Version Time in Seconds
1024x1024 CPU (serial) 55
GPU (single) 8
2 MPI 2 GPUs 10
1 MPI 2 GPUs w P2P 9
4096x4096 CPU (serial) 872
GPU (single) 72
2 MPI 2 GPUs 40
1 MPI 2 GPUs w P2P 40
16384x16384 CPU (serial) 13926
GPU (single) 1170
2 MPI 2 GPUs 604
1 MPI 2 GPUs w P2P 593

6. Conclusion

For this example, both versions see good scaling between one and two GPUs. While not quite a 2x speed-up, this is to be expected because of the extra overhead incurred when copying data between devices. We saw only a slight performance gain using the Peer-to-Peer version compared to the MPI only version. However, we expect P2P performance to increase as more GPUs are added. In Part 2, we'll investigate this further by running both MPI versions on a large GPU enabled cluster with GPUDirect enabled network cards. Stay tuned!