PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

If use multiple GPUs, can I set device variables globally?
Goto page Previous  1, 2
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
mkcolg



Joined: 30 Jun 2004
Posts: 6146
Location: The Portland Group Inc.

PostPosted: Mon Jul 25, 2011 12:45 pm    Post subject: Reply with quote

Quote:
Does cula_initialize() create a separate context as well?
No this should be fine.

Maybe something else is happening. Are you trying to call the CULA host routines with device data, or CULA device routines with host data? Can you write-up a simple example that show your error?

To test cula_initialize, I took the following CULA example program and added some basic MPI calls. Granted, I'm only running on a single node with two GPUs, but it seems to work for me.

Code:
% cat test_cula.cuf
       module cula_test
     
            use cudafor
            contains
                         
            ! gpu error reporting routine
            subroutine check_status(status)
           
                integer status
                integer info
                integer cula_geterrorinfo
                info = cula_geterrorinfo()
                if (status .ne. 0) then
                    if (status .eq. 7) then
                        write(*,*) 'invalid value for parameter ', info
                    else if (status .eq. 8) then
                        write(*,*) 'data error (', info ,')'
                    else if (status .eq. 9) then
                        write(*,*) 'blas error (', info ,')'
                    else if (status .eq. 10) then
                        write(*,*) 'runtime error (', info ,')'
                    else
                        call cula_getstatusstring(status)
                    endif
                    stop 1
                end if
               
            end subroutine
           
            ! cpu test (baseline)
            subroutine do_cpu_test(n,nrhs,ain,bin)
               
                ! input
                real,dimension(:,:) :: ain,bin
               
                ! allocations
                real,dimension(:,:),allocatable :: a,b,ans
                integer,dimension(:),allocatable :: ipiv
                integer n,nrhs
                integer c1,c2,cr,cm
                real norm
               
                ! back up input for reconstruction test
                allocate( a(n,n), b(n,nrhs), ipiv(n), ans(n,nrhs) )
                a = ain
                b = bin               
               
                ! start test
                call system_clock( c1, cr, cm )
                print *, 'starting cpu test...'
                               
                ! call lapack solver
                call sgesv(n,nrhs,a,n,ipiv,b,n,info)
               
                ! stop test
                call system_clock( count=c2 )
                print *, '  runtime:', 1.e3*real(c2-c1) / real(cr), 'ms'
                print *, '  gflops:', (0.66*n**3.) / (real(c2-c1) / real(cr)) / (1.e9)
               
                ! check answer
                ans = bin;
                call sgemm('n','n',n,nrhs,n,1.0,ain,n,b,n,-1.0,ans,n)
                norm = slange('1',n,nrhs,ans,n,work) / real(n)
                print *, '  error:', norm
                print *, ''
               
                ! cleanup
                deallocate(a,b,ipiv,ans)
               
            end subroutine do_cpu_test
           
            ! cula test (host interface)
            subroutine do_cula_host_test(n,nrhs,ain,bin)
               
                ! input
                real,dimension(:,:) :: ain,bin
               
                ! allocations (all on host)
                real,dimension(:,:),allocatable :: a,b,ans
                integer,dimension(:),allocatable :: ipiv
                integer n,nrhs,status
                integer c1,c2,cr,cm
                real norm
               
                ! back up input for reconstruction test
                allocate( a(n,n), b(n,nrhs), ipiv(n), ans(n,nrhs) )
                a = ain
                b = bin               
               
                ! start test
                call system_clock( c1,cr,cm )
                print *, 'starting cula (host interface) test...'
                               
                ! call cula solver (host interface)
                status = cula_sgesv(n,nrhs,a,n,ipiv,b,n)
                call check_status(status)
               
                ! stop test
                call system_clock( count=c2 )
                print *, '  runtime:', 1.e3*real(c2-c1) / real(cr), 'ms'
                print *, '  gflops:', (0.66*n**3.) / (real(c2-c1) / real(cr)) / (1.e9)
               
                ! check answer
                ans = bin;
                call sgemm('n','n',n,nrhs,n,1.0,ain,n,b,n,-1.0,ans,n)
                norm = slange('1',n,nrhs,ans,n,work) / real(n)
                print *, '  error:', norm
                print *, ''
               
                ! cleanup
                deallocate(a,b,ipiv,ans)
               
            end subroutine do_cula_host_test
                       
            ! cula test (device interface)
            subroutine do_cula_device_test(n,nrhs,ain,bin)
           
                ! input
                real,dimension(:,:) :: ain,bin
               
                ! allocations (all on host)
                real,dimension(:,:),allocatable :: a,b,ans
                integer n,nrhs,status
                integer c1,c2,cr,cm
                real norm
               
                ! gpu memory
                real,device,dimension(:,:),allocatable :: a_dev,b_dev
                integer,device,dimension(:),allocatable :: ipiv_dev
               
                ! back up input for reconstruction test
                allocate( a(n,n), b(n,nrhs), ans(n,nrhs) )
                a(1:n,1:n) = ain
                b(1:n,1:nrhs) = bin               
               
                ! allocate gpu memory
                allocate( a_dev(n,n), b_dev(n,nrhs), ipiv_dev(n) )
               
                ! start test
                call system_clock( c1,cr,cm )
                print *, 'starting cula (device interface) test...'
               
                ! copy memory to gpu
                a_dev = a
                b_dev = b
               
                ! call cula solver (device interface)
                status = cula_device_sgesv(n,nrhs,a_dev,n,ipiv_dev,b_dev,n)
               
                ! copy answer to host
                b = b_dev
               
                ! stop test
                call system_clock( count=c2 )
                print *, '  runtime:', 1.e3*real(c2-c1) / real(cr), 'ms'
                print *, '  gflops:', (0.66*n**3.) / (real(c2-c1) / real(cr)) / (1.e9)
               
                ! check answer
                ans(1:n,1:nrhs) = bin;
                call sgemm('n','n',n,nrhs,n,1.,ain,n,b,n,-1.,ans,n)
                norm = slange('1',n,nrhs,ans,n,work) / real(n)
                print *, '  error:', norm
                print *, ''
               
                ! cleanup
                deallocate(a,b,ans)
                deallocate(a_dev,b_dev,ipiv_dev)
               
            end subroutine do_cula_device_test
           
        end module cula_test
       
        ! main program

        program cula

            use cula_test
            include 'mpif.h'
 
            real error,eps
           
            ! Host memory
            real,dimension(:,:),allocatable :: a, b
            integer n, info, i, j, status
         integer :: ierr, myid, numprocs, rc, mydev, numdev

            call MPI_INIT( ierr )
            call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr )
            call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr )
   
            ierr = cudaGetDeviceCount(numdev)
       mydev = mod(myid,numdev)
    
            print *, "Process ", myid, " of ", numprocs, " took GPU: ", mydev
            ierr = cudaSetDevice(mydev)

            n = 10000
            nrhs = 1

            print *,'cula + pgfortran test (matrix solve)'
            print *,'  array size: ', n, ' by ', n
            print *,'  right hand sides: ', nrhs
            print *,''
            allocate( a(n,n), b(n,nrhs) )
                                   
            ! intialize a and b
            call random_number(a)
            call random_number(b)
           
            ! Make sure a() isn't singular
            do i=1,n
                a(i,i) = 10. * a(i,i) + 10.
            enddo
                       
            ! initialize cula
            status = cula_initialize()
            call check_status(status)
           
            ! do cpu test (baseline)
            call do_cpu_test(n,nrhs,a,b)
                                           
            ! do gpu test (host interface)
            call do_cula_host_test(n,nrhs,a,b)
           
            ! do gpu test (device interface)
            call do_cula_device_test(n,nrhs,a,b)

            call MPI_FINALIZE(rc)
             
        end program cula

% pgf90 -Mmpi -V11.7 -fast -o test_cula.out test_cula.cuf -L/usr/local/cula/lib64 -lcula_pgfortran -lcula -lcublas -L /usr/pgi/linux86-64/2011/acml/4.4.0/lib -lacml
% mpirun -machinefile machines.LINUX -np 4 test_cula.out
 Process             0  of             4  took GPU:             0
 cula + pgfortran test (matrix solve)
   array size:         10000  by         10000
   right hand sides:             1
 
 starting cpu test...
   runtime:    33464.90     ms
   gflops:    19.72215   
   error:   4.8381262E-03
 
 starting cula (host interface) test...
   runtime:    15007.96     ms
   gflops:    43.97665   
   error:   4.5987219E-03
 
 starting cula (device interface) test...
   runtime:    1733.560     ms
   gflops:    380.7195   
   error:   4.5987219E-03
 
 Process             3  of             4  took GPU:             1
 cula + pgfortran test (matrix solve)
   array size:         10000  by         10000
   right hand sides:             1
 
 starting cpu test...
   runtime:    33225.06     ms
   gflops:    19.86452   
   error:   4.8381262E-03
 
 starting cula (host interface) test...
   runtime:    6606.524     ms
   gflops:    99.90125   
   error:   4.5987219E-03
 
 starting cula (device interface) test...
   runtime:    8241.342     ms
   gflops:    80.08405   
   error:   4.5987219E-03
 
 Process             1  of             4  took GPU:             1
 cula + pgfortran test (matrix solve)
   array size:         10000  by         10000
   right hand sides:             1
 
 starting cpu test...
   runtime:    33354.10     ms
   gflops:    19.78767   
   error:   4.8381262E-03
 
 starting cula (host interface) test...
   runtime:    15385.24     ms
   gflops:    42.89825   
   error:   4.5987219E-03
 
 starting cula (device interface) test...
   runtime:    1710.793     ms
   gflops:    385.7860   
   error:   4.5987219E-03
 
 Process             2  of             4  took GPU:             0
 cula + pgfortran test (matrix solve)
   array size:         10000  by         10000
   right hand sides:             1
 
 starting cpu test...
   runtime:    33443.48     ms
   gflops:    19.73479   
   error:   4.8381262E-03
 
 starting cula (host interface) test...
   runtime:    4807.774     ms
   gflops:    137.2777   
   error:   4.5987219E-03
 
 starting cula (device interface) test...
   runtime:    8166.234     ms
   gflops:    80.82061   
   error:   4.5987219E-03
 
Back to top
View user's profile
bsb3166



Joined: 27 Jun 2011
Posts: 10

PostPosted: Wed Aug 31, 2011 10:37 am    Post subject: performance of cula drops when runs on multi-CPU+multi-GPU Reply with quote

mkcolg wrote:
Quote:
Does cula_initialize() create a separate context as well?
No this should be fine.

Maybe something else is happening. Are you trying to call the CULA host routines with device data, or CULA device routines with host data? Can you write-up a simple example that show your error?

To test cula_initialize, I took the following CULA example program and added some basic MPI calls. Granted, I'm only running on a single node with two GPUs, but it seems to work for me.

Code:

% mpirun -machinefile machines.LINUX -np 4 test_cula.out
 Process             0  of             4  took GPU:             0
 cula + pgfortran test (matrix solve)
   array size:         10000  by         10000
   right hand sides:             1
 
 starting cpu test...
   runtime:    33464.90     ms
   gflops:    19.72215   
   error:   4.8381262E-03
 
 starting cula (host interface) test...
   runtime:    15007.96     ms
   gflops:    43.97665   
   error:   4.5987219E-03
 
 starting cula (device interface) test...
   runtime:    1733.560     ms
   gflops:    380.7195   
   error:   4.5987219E-03
 
 Process             3  of             4  took GPU:             1
 cula + pgfortran test (matrix solve)
   array size:         10000  by         10000
   right hand sides:             1
 
 starting cpu test...
   runtime:    33225.06     ms
   gflops:    19.86452   
   error:   4.8381262E-03
 
 starting cula (host interface) test...
   runtime:    6606.524     ms
   gflops:    99.90125   
   error:   4.5987219E-03
 
 starting cula (device interface) test...
   runtime:    8241.342     ms
   gflops:    80.08405   
   error:   4.5987219E-03
 
 Process             1  of             4  took GPU:             1
 cula + pgfortran test (matrix solve)
   array size:         10000  by         10000
   right hand sides:             1
 
 starting cpu test...
   runtime:    33354.10     ms
   gflops:    19.78767   
   error:   4.8381262E-03
 
 starting cula (host interface) test...
   runtime:    15385.24     ms
   gflops:    42.89825   
   error:   4.5987219E-03
 
 starting cula (device interface) test...
   runtime:    1710.793     ms
   gflops:    385.7860   
   error:   4.5987219E-03
 
 Process             2  of             4  took GPU:             0
 cula + pgfortran test (matrix solve)
   array size:         10000  by         10000
   right hand sides:             1
 
 starting cpu test...
   runtime:    33443.48     ms
   gflops:    19.73479   
   error:   4.8381262E-03
 
 starting cula (host interface) test...
   runtime:    4807.774     ms
   gflops:    137.2777   
   error:   4.5987219E-03
 
 starting cula (device interface) test...
   runtime:    8166.234     ms
   gflops:    80.82061   
   error:   4.5987219E-03
 


Thanks a lot for your test code.

But I found a problem in the test result either in yours or mine.

The performance of cula_sgesv and cula_device_sgesv on one CPU with one GPU should be around 100 Gflops and 350 Gflops.

Everyone would think when it runs on multiple cores with multiple GPUs, the performance of the cula host and device routines should keep the same with when they are executed on one core with one GPU.

However, from your test result, I see the performances of host cula routine on CPU 0, 1 ,2 ,3 are 43.9, 42.8, 137.2, 99.9 Gflops. They are really far from the performance when running one one GPU. Why the performance drops so much? They're supposed to keep.

Do you think it's the problem of the cula routines or the devices?

I also did some test on CULA with the similar code that you use, and found the same problem and even worse performance dropping.

But I did some test on similar routines on MAGMA as well. Not performance dropping problem when it runs on multiple GPUs.

Below is my test result on 4 GPUs compared with it runs one 1 CPU with 1 GPU.

My test code is slightly different from yours, I put the MPI initialization module in a separate file. I compile mpi file with mpif90 (pf90 and openmpi implementation), compile the cuf code with pgfrotran.

1 CPU + 1 GPU:
Code:
 cula + pgfortran test (matrix solve)
   array size:         10000  by         10000
   right hand sides:             1

 starting cpu test...
   runtime:    36437.05     ms
   gflops:    18.11343
   error:   5.0552888E-03

 starting cula (host interface) test...
   runtime:    4183.666     ms
   gflops:    157.7564
   error:   1.2658446E-03

 starting cula (device interface) test...
   runtime:    3822.783     ms
   gflops:    172.6491
   error:   1.2658446E-03



4 CPUs + 4 GPUs:
Code:
Process             0  of             4  took GPU:             0
 Process             3  of             4  took GPU:             3
 Process             2  of             4  took GPU:             2
 Process             1  of             4  took GPU:             1
 cula + pgfortran test (matrix solve)
   array size:         10000  by         10000
   right hand sides:             1

 cula + pgfortran test (matrix solve)
   array size:         10000  by         10000
   right hand sides:             1

 cula + pgfortran test (matrix solve)
   array size:         10000  by         10000
   right hand sides:             1

 cula + pgfortran test (matrix solve)
   array size:         10000  by         10000
   right hand sides:             1

 starting cpu test...
 starting cpu test...
 starting cpu test...
 starting cpu test...
   runtime:    35684.69     ms
   gflops:    18.49533
   error:   5.0552888E-03

   runtime:    35932.07     ms
   gflops:    18.36799
   runtime:    35987.90     ms
   gflops:    18.33949
   error:   5.0552888E-03

   error:   5.0552888E-03

   runtime:    37404.25     ms
   gflops:    17.64506
   error:   5.0552888E-03

 starting cula (host interface) test...
 starting cula (host interface) test...
 starting cula (host interface) test...
 starting cula (host interface) test...
   runtime:    7521.451     ms
   gflops:    87.74902
   error:   1.2658446E-03

   runtime:    170592.8     ms
   gflops:    3.868863
   error:   1.2658446E-03

   runtime:    178027.4     ms
   runtime:    178026.6     ms

   gflops:    3.707310
   gflops:    3.707294
   error:   1.2658446E-03

   error:   1.2658446E-03

 starting cula (device interface) test...
 starting cula (device interface) test...
 starting cula (device interface) test...
 starting cula (device interface) test...
   runtime:    241970.9     ms
   gflops:    2.727601
   error:   1.2658446E-03

   runtime:    242817.9     ms
   gflops:    2.718086
   error:   1.2658446E-03

   runtime:    245411.0     ms
   gflops:    2.689365
   error:   1.2658446E-03

   runtime:    245853.4     ms
   gflops:    2.684527
   error:   1.2658446E-03


Compared with 157.75 and 172.64 Gflops on single CPU with single GPU, when it runs on 4 GPUs, it drops to 3 Gflops and 2 Gflops for host and device cula sgesv routines.

I'm working on a project which'll use this cula routines with MPI+multiple GPUs. Since this issue, this definitely won't get any advantage to use CULA routines instead of same routines in MKL (The MKL routines I test is 18 Gflops).

Do you have any experience on this? I'll really appreciate if you can give me some advice.

My test code:
test_cula.cuf
Code:

module cula_test
     
            use cudafor
            contains
                         
            ! gpu error reporting routine
            subroutine check_status(status)
           
                integer status
                integer info
                integer cula_geterrorinfo
                info = cula_geterrorinfo()
                if (status .ne. 0) then
                    if (status .eq. 7) then
                        write(*,*) 'invalid value for parameter ', info
                    else if (status .eq. 8) then
                        write(*,*) 'data error (', info ,')'
                    else if (status .eq. 9) then
                        write(*,*) 'blas error (', info ,')'
                    else if (status .eq. 10) then
                        write(*,*) 'runtime error (', info ,')'
                    else
                        call cula_getstatusstring(status)
                    endif
                    stop 1
                end if
               
            end subroutine
           
            ! cpu test (baseline)
            subroutine do_cpu_test(n,nrhs,ain,bin)
               
                ! input
                real,dimension(:,:) :: ain,bin
               
                ! allocations
                real,dimension(:,:),allocatable :: a,b,ans
                integer,dimension(:),allocatable :: ipiv
                integer n,nrhs
                integer c1,c2,cr,cm
                real norm
               
                ! back up input for reconstruction test
                allocate( a(n,n), b(n,nrhs), ipiv(n), ans(n,nrhs) )
                a = ain
                b = bin               
               
                ! start test
                call system_clock( c1, cr, cm )
                print *, 'starting cpu test...'
                               
                ! call lapack solver
                call sgesv(n,nrhs,a,n,ipiv,b,n,info)
               
                ! stop test
                call system_clock( count=c2 )
                print *, '  runtime:', 1.e3*real(c2-c1) / real(cr), 'ms'
                print *, '  gflops:', (0.66*n**3.) / (real(c2-c1) / real(cr)) / (1.e9)
               
                ! check answer
                ans = bin;
                call sgemm('n','n',n,nrhs,n,1.0,ain,n,b,n,-1.0,ans,n)
                norm = slange('1',n,nrhs,ans,n,work) / real(n)
                print *, '  error:', norm
                print *, ''
               
                ! cleanup
                deallocate(a,b,ipiv,ans)
               
            end subroutine do_cpu_test
           
            ! cula test (host interface)
            subroutine do_cula_host_test(n,nrhs,ain,bin)
               
                ! input
                real,dimension(:,:) :: ain,bin
               
                ! allocations (all on host)
                real,dimension(:,:),allocatable :: a,b,ans
                integer,dimension(:),allocatable :: ipiv
                integer n,nrhs,status
                integer c1,c2,cr,cm
                real norm
               
                ! back up input for reconstruction test
                allocate( a(n,n), b(n,nrhs), ipiv(n), ans(n,nrhs) )
                a = ain
                b = bin               
               
                ! start test
                call system_clock( c1,cr,cm )
                print *, 'starting cula (host interface) test...'
                               
                ! call cula solver (host interface)
                status = cula_sgesv(n,nrhs,a,n,ipiv,b,n)
                call check_status(status)
               
                ! stop test
                call system_clock( count=c2 )
                print *, '  runtime:', 1.e3*real(c2-c1) / real(cr), 'ms'
                print *, '  gflops:', (0.66*n**3.) / (real(c2-c1) / real(cr)) / (1.e9)
               
                ! check answer
                ans = bin;
                call sgemm('n','n',n,nrhs,n,1.0,ain,n,b,n,-1.0,ans,n)
                norm = slange('1',n,nrhs,ans,n,work) / real(n)
                print *, '  error:', norm
                print *, ''
               
                ! cleanup
                deallocate(a,b,ipiv,ans)
               
            end subroutine do_cula_host_test
                       
            ! cula test (device interface)
            subroutine do_cula_device_test(n,nrhs,ain,bin)
           
                ! input
                real,dimension(:,:) :: ain,bin
               
                ! allocations (all on host)
                real,dimension(:,:),allocatable :: a,b,ans
                integer n,nrhs,status
                integer c1,c2,cr,cm
                real norm
               
                ! gpu memory
                real,device,dimension(:,:),allocatable :: a_dev,b_dev
                integer,device,dimension(:),allocatable :: ipiv_dev
               
                ! back up input for reconstruction test
                allocate( a(n,n), b(n,nrhs), ans(n,nrhs) )
                a(1:n,1:n) = ain
                b(1:n,1:nrhs) = bin               
               
                ! allocate gpu memory
                allocate( a_dev(n,n), b_dev(n,nrhs), ipiv_dev(n) )
               
                ! start test
                call system_clock( c1,cr,cm )
                print *, 'starting cula (device interface) test...'
               
                ! copy memory to gpu
                a_dev = a
                b_dev = b
               
                ! call cula solver (device interface)
                status = cula_device_sgesv(n,nrhs,a_dev,n,ipiv_dev,b_dev,n)
               
                ! copy answer to host
                b = b_dev
               
                ! stop test
                call system_clock( count=c2 )
                print *, '  runtime:', 1.e3*real(c2-c1) / real(cr), 'ms'
                print *, '  gflops:', (0.66*n**3.) / (real(c2-c1) / real(cr)) / (1.e9)
               
                ! check answer
                ans(1:n,1:nrhs) = bin;
                call sgemm('n','n',n,nrhs,n,1.,ain,n,b,n,-1.,ans,n)
                norm = slange('1',n,nrhs,ans,n,work) / real(n)
                print *, '  error:', norm
                print *, ''
               
                ! cleanup
                deallocate(a,b,ans)
                deallocate(a_dev,b_dev,ipiv_dev)
               
            end subroutine do_cula_device_test
           
        end module cula_test
       
        ! main program

        program cula

            use cula_test
            use more_mpi

           
            ! Host memory
            real,dimension(:,:),allocatable :: a, b
            integer n, info, i, j, status
            integer :: rc, mydev, numdev

           
            call init
   
            ierr = cudaGetDeviceCount(numdev)
            mydev =  mod(cpuid,numdev)
     
            print *, "Process ", cpuid, " of ", numprocs, " took GPU: ", mydev
            !ierr = cudaSetDevice(mydev)
            ierr = cula_selectdevice(mydev)
            call check_status(ierr)

            n = 10000
            nrhs = 1

            print *,'cula + pgfortran test (matrix solve)'
            print *,'  array size: ', n, ' by ', n
            print *,'  right hand sides: ', nrhs
            print *,''
            allocate( a(n,n), b(n,nrhs) )
                                   
            ! intialize a and b
            call random_number(a)
            call random_number(b)
           
            ! Make sure a() isn't singular
            do i=1,n
                a(i,i) = 10. * a(i,i) + 10.
            enddo
           
            call MPI_Barrier(mpi_comm_world,ierr)
           
            ! initialize cula
            status = cula_initialize()
            call check_status(status)
           
            ! do cpu test (baseline)
            call do_cpu_test(n,nrhs,a,b)
            call MPI_Barrier(mpi_comm_world,ierr)
                               
            ! do gpu test (host interface)
            call do_cula_host_test(n,nrhs,a,b)
            call MPI_Barrier(mpi_comm_world,ierr)
           
            ! do gpu test (device interface)
            call do_cula_device_test(n,nrhs,a,b)
            call MPI_Barrier(mpi_comm_world,ierr)

            call MPI_FINALIZE(rc)
             
        end program cula


init.f

Code:
subroutine init
   use more_mpi
   call mpi_init(ierr)
   call mpi_comm_rank(mpi_comm_world,cpuid,ierr)
   call mpi_comm_size(mpi_comm_world,numprocs,ierr)
   call mpi_get_processor_name(processor_name,namelen,ierr)
end subroutine init


more_mpi.f
Code:
module more_mpi
   include 'mpif.h'
   integer :: ierr,cpuid,numprocs,namelen !mpi
   character(len=100) processor_name
end module


makefile
Code:
.SUFFIXES: .cuf .o

L1= test_cula.o more_mpi.o init.o
L2=

CULAINCLUDES= -I${CULA_INC_PATH}
CULALIBPATH64= -L${CULA_LIB_PATH_64}


CUDAINCLUDES= -I${CUDA_INC_PATH}
CUDALIBPATH64= -L${CUDA_LIB_PATH_64}

CUDALIBS= -lcudart -lcuda


GPULIBS= -lcula_pgfortran #-lcula -lcublas -lcudart

PGFLAGS= -Mfree -O3

#CUDA= -ta=nvidia -Mcuda
CUDA=


SOPT=
LINK1=  /opt/intel/Compiler/11.1/069/mkl/lib/em64t/libmkl_scalapack_lp64.a \
       /opt/intel/Compiler/11.1/069/mkl/lib/em64t/libmkl_intel_lp64.a \
       /opt/intel/Compiler/11.1/069/mkl/lib/em64t/libmkl_blacs_openmpi_lp64.a \
       /opt/intel/Compiler/11.1/069/mkl/lib/em64t/libmkl_core.a \
       /opt/intel/Compiler/11.1/069/mkl/lib/em64t/libmkl_sequential.a \
       /opt/intel/Compiler/11.1/069/mkl/lib/em64t/libmkl_core.a \
       /opt/intel/Compiler/11.1/069/mkl/lib/em64t/libmkl_sequential.a \
       /opt/intel/Compiler/11.1/069/mkl/lib/em64t/libmkl_core.a \
       -lpthread

#LINK_CU=  /opt/pgi/linux86-64/10.6/lib/libcudafor.a
LINK_CU=  /opt/pgi/linux86-64/11.5/lib/libcudafor.a



PF90= mpif90

PGFOR= pgfortran


PGA_EX= cula_test_mgpus

darwin: $(L1) $(L2)
   $(PF90) $(SOPT) $(PGFLAGS) $(L1) $(L2) $(CUDAINCLUDES) $(CUDALIBPATH64) $(CUDALIBS) $(CULAINCLUDES) $(CULALIBPATH64) $(GPULIBS) $(LINK1) $(LINK_CU) -o $(PGA_EX)

.f.o:
   $(PF90) $(SOPT) $(PGFLAGS) $(CUDAINCLUDES) $(CUDALIBPATH64) $(CUDALIBS)$(CULAINCLUDES) $(CULALIBPATH64) $(GPULIBS) -c $<

.cuf.o:
   $(PGFOR) $(SOPT) $(PGFLAGS) $(CUDAINCLUDES) $(CUDALIBPATH64) $(CUDALIBS) $(CULAINCLUDES) $(CULALIBPATH64) $(GPULIBS) -c $<

test_cula.o: test_cula.cuf init.o more_mpi.o

more_mpi.o: more_mpi.f

init.o: init.f more_mpi.o


clean:
   /bin/rm -f *o *mod $(L1b) $(L2b) $(PGA_EX)
   
del:
   rm -f *.mio.mines.edu *.000 *.001 *.002 *.003

move:
   mv *.edu ./result
Back to top
View user's profile
John EMP



Joined: 30 Aug 2010
Posts: 1

PostPosted: Thu Sep 01, 2011 6:45 am    Post subject: Reply with quote

[quote="mkcolg"]Hi Chong,

Quote:
cula_status = cula_selectdevice(gpuid)
I haven't done much with CULA myself, but I don't think you want to use this function. I'm thinking it creates a separate context, performs it's own initialization and either wipes out or prevents ours from being performed. I'm not sure, but I'd try just using cudaSetDevice and the CULA CUDA Fortran module (i.e. 'use cula') as shown in the articles I posted about.
Quote:



- Mat


Hi I'm from the CULA team. I hadn't realized there were relevant threads on these forums, so we'll try to keep a better eye on it. Please feel free to ping us when some thread needs a response.

Getting to this question, cula_selectdevice is merely a convenience passthrough for the cudaSetDevice routine, for anybody who might not wan to install the CUDA toolkit for just that routine. Since we're looking to use CULA's device interface here, then it is perfectly fine to use either CULA's or CUDA's. Again, culaSelectDevice adds no functionality beyond cudaSetDevice, which is why you still need to call cula_initialize afterwards.
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6146
Location: The Portland Group Inc.

PostPosted: Tue Sep 06, 2011 1:43 pm    Post subject: Reply with quote

Quote:
However, from your test result, I see the performances of host cula routine on CPU 0, 1 ,2 ,3 are 43.9, 42.8, 137.2, 99.9 Gflops. They are really far from the performance when running one one GPU. Why the performance drops so much? They're supposed to keep.
This is because I was running 4 MPI processes but only have 2 GPUs. When I run with only 2 MPI processes, I get the expected 350 Gflops.

With your test code, I get similar results where there is a slight slow down going from 1 to 2 MPI process, and a much larger slow down when going to 4 since I only have 2 GPUs. Why your seeing such a large slow down I'm not sure. Granted, I'm using MPICH rather than OpenMPI, but I doubt that's the cause.

- Mat

Code:
% mpirun -machinefile machines.LINUX -np 1 test_cula.out
 Process             0  of             1  took GPU:             0
 cula + pgfortran test (matrix solve)
   array size:         10000  by         10000
   right hand sides:             1

 starting cpu test...
   runtime:    32975.53     ms
   gflops:    20.01484
   error:   4.8381262E-03

 starting cula (host interface) test...
   runtime:    1804.901     ms
   gflops:    365.6710
   error:   4.5987219E-03

 starting cula (device interface) test...
   runtime:    1691.823     ms
   gflops:    390.1117
   error:   4.5987219E-03

% mpirun -machinefile machines.LINUX -np 2 test_cula.out
 Process             0  of             2  took GPU:             0
 cula + pgfortran test (matrix solve)
   array size:         10000  by         10000
   right hand sides:             1

 starting cpu test...
   runtime:    32948.79     ms
   gflops:    20.03108
   error:   4.8381262E-03

 starting cula (host interface) test...
   runtime:    1997.535     ms
   gflops:    330.4072
   error:   4.5987219E-03

 starting cula (device interface) test...
   runtime:    1791.703     ms
   gflops:    368.3646
   error:   4.5987219E-03

 Process             1  of             2  took GPU:             1
 cula + pgfortran test (matrix solve)
   array size:         10000  by         10000
   right hand sides:             1

 starting cpu test...
   runtime:    33058.83     ms
   gflops:    19.96441
   error:   4.8381262E-03

 starting cula (host interface) test...
   runtime:    1940.003     ms
   gflops:    340.2057
   error:   4.5987219E-03

 starting cula (device interface) test...
   runtime:    1760.786     ms
   gflops:    374.8326
   error:   4.5987219E-03

% mpirun -machinefile machines.LINUX -np 4 test_cula.out
 Process             0  of             4  took GPU:             0
 cula + pgfortran test (matrix solve)
   array size:         10000  by         10000
   right hand sides:             1

 starting cpu test...
   runtime:    33447.90     ms
   gflops:    19.73218
   error:   4.8381262E-03

 starting cula (host interface) test...
   runtime:    7527.005     ms
   gflops:    87.68427
   error:   4.5987219E-03

 starting cula (device interface) test...
   runtime:    4662.672     ms
   gflops:    141.5497
   error:   4.5987219E-03

 Process             2  of             4  took GPU:             0
 cula + pgfortran test (matrix solve)
   array size:         10000  by         10000
   right hand sides:             1

 starting cpu test...
   runtime:    33480.27     ms
   gflops:    19.71311
   error:   4.8381262E-03

 starting cula (host interface) test...
   runtime:    7160.245     ms
   gflops:    92.17562
   error:   4.5987219E-03

 starting cula (device interface) test...
   runtime:    3568.168     ms
   gflops:    184.9689
   error:   4.5987219E-03

 Process             1  of             4  took GPU:             1
 cula + pgfortran test (matrix solve)
   array size:         10000  by         10000
   right hand sides:             1

 starting cpu test...
   runtime:    33291.96     ms
   gflops:    19.82460
   error:   4.8381262E-03

 starting cula (host interface) test...
   runtime:    6522.568     ms
   gflops:    101.1871
   error:   4.5987219E-03

 starting cula (device interface) test...
   runtime:    6788.867     ms
   gflops:    97.21799
   error:   4.5987219E-03

 Process             3  of             4  took GPU:             1
 cula + pgfortran test (matrix solve)
   array size:         10000  by         10000
   right hand sides:             1

 starting cpu test...
   runtime:    33557.15     ms
   gflops:    19.66794
   error:   4.8381262E-03

 starting cula (host interface) test...
   runtime:    8348.127     ms
   gflops:    79.05965
   error:   4.5987219E-03

 starting cula (device interface) test...
   runtime:    6695.790     ms
   gflops:    98.56940
   error:   4.5987219E-03
Back to top
View user's profile
bsb3166



Joined: 27 Jun 2011
Posts: 10

PostPosted: Tue Sep 13, 2011 6:25 pm    Post subject: Reply with quote

mkcolg wrote:
Quote:
However, from your test result, I see the performances of host cula routine on CPU 0, 1 ,2 ,3 are 43.9, 42.8, 137.2, 99.9 Gflops. They are really far from the performance when running one one GPU. Why the performance drops so much? They're supposed to keep.
This is because I was running 4 MPI processes but only have 2 GPUs. When I run with only 2 MPI processes, I get the expected 350 Gflops.

With your test code, I get similar results where there is a slight slow down going from 1 to 2 MPI process, and a much larger slow down when going to 4 since I only have 2 GPUs. Why your seeing such a large slow down I'm not sure. Granted, I'm using MPICH rather than OpenMPI, but I doubt that's the cause.

- Mat


Thank you for you reply.

Is the mpirun you use in the sub-directory pf pgi compiler "/opt/pgi/linux86-64/2010/mpi/mpich/bin/mpirun" OR the mpich you build elsewhere?

I try to do the same way you did, but I met some error information when I run it.

Code:
pgf90 -Mmpi -fast -o test_cula test_cula.cuf -L/opt/pgi/linux86-64/2010/mpi/mpich/lib/ -L/opt/development/gpu/current/cula/lib64 -lcula_pgfortran -lcula -lcublas -L/opt/pgi/linux86-64/2011/acml/4.4.0/lib -lacml


Code:
-bash-3.2$ cat run3
/opt/pgi/linux86-64/2010/mpi/mpich/bin/mpirun -np 3 ./test_cula
-bash-3.2$ ./run3
p1_22929:  p4_error: interrupt SIGFPE: 8
p2_22997:  p4_error: interrupt SIGFPE: 8
p0_22918:  p4_error: interrupt SIGx: 13
p0_22918: (7.636719) net_send: could not write to fd=4, errno = 32


Do you know how to solve this problem?

Thank you in advance.
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page Previous  1, 2
Page 2 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group