PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

accelerate a single loop with mpi and gpu
Goto page Previous  1, 2, 3, 4, 5  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
brush



Joined: 26 Jun 2012
Posts: 44

PostPosted: Tue May 21, 2013 3:39 pm    Post subject: Reply with quote

Hi again- so at the moment I'm splitting the work of a loop between two GPUs, and using OMP to do so. I'm not expecting this to work well for reasons you have mentioned, but I'm still curious.

Previously I accelerated this loop with just a PGI compute region, and another time with just OMP. Both seemed to work.

Now, to achieve my goal I have added some code, and changed the do loop slightly:

Code:

         call omp_set_num_threads(2)

!$OMP PARALLEL PRIVATE(myid)
             myid = OMP_GET_THREAD_NUM()
             call acc_set_device_num(myid, acc_device_nvidia)
!$OMP END PARALLEL

             chunk=mm/2

!$OMP PARALLEL PRIVATE(exp1,eyp,wx0,wx1,wy0,
!$OMP&                 wy1,m,i,j,rhog,vfac,kap,th,xt,yt)
!$acc kernels loop
             do 100 m=i*chunk+1,i*chunk+chunk ! previously was do 100 m=1, mm


The second PRIVATE statement is the same one I used in my "just omp" implementation. However, in this case, upon compiling I get:

Code:

pgfortran -fast -Msave -mp -acc -Minfo=accel  -c slab.f
ppush:
    288, Generating present_or_copy(y3(:))
         Generating present_or_copy(y1(:))
         Generating present_or_copy(x3(:))
         Generating present_or_copy(x1(:))
         Generating present_or_copyin(u2(:))
         Generating present_or_copy(u3(:))
         Generating present_or_copy(u1(:))
         Generating present_or_copyin(ey(:,:))
         Generating present_or_copyin(ex(:,:))
         Generating present_or_copyin(rwx(:lr))
         Generating present_or_copyin(x2(:))
         Generating present_or_copyin(rwy(:lr))
         Generating present_or_copyin(y2(:))
         Generating present_or_copyin(mu(:))
         Generating present_or_copyin(w2(:))
         Generating present_or_copy(w1(:))
         Generating present_or_copy(w3(:))
         Generating NVIDIA code
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
         Generating compute capability 3.0 binary
    289, Complex loop carried dependence of 'u1' prevents parallelization
         Complex loop carried dependence of 'u3' prevents parallelization
         Loop carried dependence due to exposed use of 'u3(:)' prevents parallelization
         Complex loop carried dependence of 'x1' prevents parallelization
         Complex loop carried dependence of 'x3' prevents parallelization
         Loop carried dependence due to exposed use of 'x3(:)' prevents parallelization
         Complex loop carried dependence of 'y1' prevents parallelization
         Complex loop carried dependence of 'y3' prevents parallelization
         Loop carried dependence due to exposed use of 'y3(:)' prevents parallelization
         Loop carried dependence due to exposed use of 'u1(:)' prevents parallelization
         Loop carried dependence due to exposed use of 'x1(:)' prevents parallelization
         Loop carried dependence due to exposed use of 'y1(:)' prevents parallelization
         Complex loop carried dependence of 'w1' prevents parallelization
         Complex loop carried dependence of 'w3' prevents parallelization
         Loop carried dependence due to exposed use of 'w3(:)' prevents parallelization
         Loop carried dependence due to exposed use of 'w1(:)' prevents parallelization
         Accelerator kernel generated
        295, !$acc loop vector(128) ! threadidx%x
         Loop is parallelizable


and the code crashes at run time. I am wondering: why would these dependencies pop up when they did not appear on the case with just the PGI directives? Looking at the code, I do not see an actual dependency, as demonstrated by the working "just directives" version.

Thanks,
Ben
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6218
Location: The Portland Group Inc.

PostPosted: Wed May 22, 2013 9:59 am    Post subject: Reply with quote

Hi Ben,

Quote:
why would these dependencies pop up when they did not appear on the case with just the PGI directives?
Sorry, I can't tell from just this. Can you post a bit more of the code, including how these variables are being used? To work around it you can add the "independent" clause to your "kernels" directive.

Quote:
!$OMP PARALLEL PRIVATE(exp1,eyp,wx0,wx1,wy0,
!$OMP& wy1,m,i,j,rhog,vfac,kap,th,xt,yt)
!$acc kernels loop
do 100 m=i*chunk+1,i*chunk+chunk ! previously was do 100 m=1, mm
Is this just a snip-it and you've cut out code? If not, then I'm wondering if "i" is uninitialized.

- Mat
Back to top
View user's profile
brush



Joined: 26 Jun 2012
Posts: 44

PostPosted: Wed May 22, 2013 1:10 pm    Post subject: Reply with quote

Quote:

Is this just a snip-it and you've cut out code? If not, then I'm wondering if "i" is uninitialized.


Oh wow, you're right! Can't believe I forgot this. I meant to have myid instead of i. Changing this got rid of the unwanted dependancies. Thanks Mat.

Now, at run time I still get some errors, which change from run to run. Perhaps you could tell me what some of these mean:

First run:
Segmentation fault (core dumped)

Second run:
call to cuMemHostUnregister returned error 713: Other
Error: _mp_pcpu_reset: lost thread

Third run:
call to cuMemcpyDtoHAsync returned error 700: Launch failed
Error: _mp_pcpu_reset: lost thread

In case it matters, the new compiler feedback and the full subroutine, respectively. myid is defined before the acc directive, so I think that is okay. Also the number 8404992 is just chunk, as my total particle number is twice that, so that seems okay. Adding copy statements to the code so that the entire arrays are copied still got the same errors.
Code:
pgfortran -fast -Msave -mp -acc -Minfo=accel  -c slab.f
ppush:
    288, Generating present_or_copyout(y3(myid*8404992+1:myid*8404992+8404992))
         Generating present_or_copy(y1(myid*8404992+1:myid*8404992+8404992))
         Generating present_or_copyout(x3(myid*8404992+1:myid*8404992+8404992))
         Generating present_or_copy(x1(myid*8404992+1:myid*8404992+8404992))
         Generating present_or_copyin(u2(myid*8404992+1:myid*8404992+8404992))
         Generating present_or_copyout(u3(myid*8404992+1:myid*8404992+8404992))
         Generating present_or_copy(u1(myid*8404992+1:myid*8404992+8404992))
         Generating present_or_copyin(ey(:,:))
         Generating present_or_copyin(ex(:,:))
         Generating present_or_copyin(rwx(:lr))
         Generating present_or_copyin(x2(myid*8404992+1:myid*8404992+8404992))
         Generating present_or_copyin(rwy(:lr))
         Generating present_or_copyin(y2(myid*8404992+1:myid*8404992+8404992))
         Generating present_or_copyin(mu(myid*8404992+1:myid*8404992+8404992))
         Generating present_or_copyin(w2(myid*8404992+1:myid*8404992+8404992))
         Generating present_or_copy(w1(myid*8404992+1:myid*8404992+8404992))
         Generating present_or_copyout(w3(myid*8404992+1:myid*8404992+8404992))
         Generating NVIDIA code
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
         Generating compute capability 3.0 binary
    289, Loop is parallelizable
         Accelerator kernel generated
        289, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
    295, Loop is parallelizable


Code:
c
          subroutine ppush

          use OMP_LIB

          include 'slab.h'


          real exp1,eyp
          real wx0,wx1,wy0,wy1
          integer m,i,j,myid,chunk
          real rhog,vfac
          real kap,th
c
! hardcoded for 2 threads
         call omp_set_num_threads(2)
c
!$OMP PARALLEL PRIVATE(myid)
             myid = OMP_GET_THREAD_NUM()
             call acc_set_device_num(myid, acc_device_nvidia)
!$OMP END PARALLEL
             chunk=mm/2
             dv=dx*dy
!$OMP PARALLEL PRIVATE(exp1,eyp,wx0,wx1,wy0,
!$OMP&                 wy1,m,i,j,rhog,vfac,kap,th,xt,yt)
!$acc kernels loop
             do 100 m=myid*chunk+1,myid*chunk+chunk
c
                exp1=0.
                eyp=0.

          rhog=sqrt(mu(m))/mims
          do l=1,lr

                xt=x2(m)+rwx(l)*rhog
                yt=y2(m)+rwy(l)*rhog

              if(xt.gt.lx) xt=2.*lx-xt
              if(xt.lt.0.)  xt=-xt
              if(xt.eq.lx)  xt=0.9999*lx
              if(yt.ge.ly) yt=yt-ly
              if(yt.le.0.)  yt=yt+ly
              if(yt.eq.ly)  yt=0.9999*ly


        if (ngp.eq.1) then
                i=int(xt/dx+0.5)
                j=int(yt/dy+0.5)
                exp1=exp1 + ex(i,j)
                eyp=eyp + ey(i,j)

        else
                i=int(xt/dx)
                j=int(yt/dy)
c
                wx0=float(i+1)-xt/dx
                wx1=1.-wx0
                wy0=float(j+1)-yt/dy
                wy1=1.-wy0
c
            exp1=exp1 + wx0*wy0*ex(i,j) + wx1*wy0*ex(i+1,j)
     %      + wx0*wy1*ex(i,j+1) + wx1*wy1*ex(i+1,j+1)
c
            eyp=eyp + wx0*wy0*ey(i,j) + wx1*wy0*ey(i+1,j)
     %      + wx0*wy1*ey(i,j+1) + wx1*wy1*ey(i+1,j+1)

        endif
        enddo ! end loop over 4 pt avg
          exp1=exp1/float(lr)
          eyp=eyp/float(lr)
c
c
c   LINEAR: epara=0. for no e para.
c
             th=theta
c1             th=(x2(m)-0.5*lx)/ls
             u3(m)=u1(m)+ epara*2.*dt*q*mims*(eyp*th)
c
             vfac=0.5*( u2(m)*u2(m) + mu(m) )*mims/tets
c             vfac=0.5*( u2(m)*u2(m) - 1.0 )*mims/tets
c
             kap=( kapn -(1.5-vfac)*kapt )
c            kap=( kapn + vfac*kapt )
c
c    LINEAR: next 3 lines are commented out if linear...
c
            x3(m)=x1(m)+ 2.*dt*( ecrossb*eyp )
            y3(m)=y1(m)+ 2.*dt*( u2(m)*th + ecrossb*(-exp1) )
c
            u1(m)=u2(m)  + .25*( u3(m) - u1(m) )
            x1(m)=x2(m)  + .25*( x3(m) - x1(m) )
            y1(m)=y2(m)  + .25*( y3(m) - y1(m) )

            if(x3(m).gt.lx) x3(m)=2.*lx-x3(m)
            if(x3(m).lt.0.) x3(m)=-x3(m)
            if(x3(m).eq.lx) x3(m)=0.9999*lx
            if(y3(m).ge.ly)  y3(m)=y3(m)-ly
            if(y3(m).le.0.)  y3(m)=y3(m)+ly
            if(y3(m).eq.ly)  y3(m)=0.9999*ly

c
c    now, calculate weight for linearized case
c
c---------weigthing delft-f (dependent on ldtype)----

            w3(m)=w1(m) + 2.*dt*( eyp*kap+q*tets*
     %      (th*eyp*u2(m)) )
     %      *(1-w2(m))
c
            w1(m)=w2(m)  + .25*( w3(m) - w1(m) )
c
  100     continue
!$OMP END PARALLEL
   90     continue
c
          return
          end
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6218
Location: The Portland Group Inc.

PostPosted: Wed May 22, 2013 1:40 pm    Post subject: Reply with quote

myid isn't private in the second parallel region so this might be the source of the new error. Try merging the two regions.
Code:

            chunk=mm/2
             dv=dx*dy
!$OMP PARALLEL PRIVATE(myid,exp1,eyp,wx0,wx1,wy0,
!$OMP&                 wy1,m,i,j,rhog,vfac,kap,th,xt,yt)
             myid = OMP_GET_THREAD_NUM()
             call acc_set_device_num(myid, acc_device_nvidia)

!$acc kernels loop
             do 100 m=myid*chunk+1,myid*chunk+chunk


You might need to set the environment variable PGI_ACC_SYNCHRONOUS=1, especially if you're using an earlier 13.x compiler.

- Mat
Back to top
View user's profile
brush



Joined: 26 Jun 2012
Posts: 44

PostPosted: Wed May 22, 2013 4:28 pm    Post subject: Reply with quote

Thanks Mat, that mostly did the trick (I get the correct output now, it just crashes occasioanlly but less often with PGI_ACC_SYNCHRONOUS=1).

I had another question with reguards to something you said earlier:
Quote:
The bottle neck here is the PCIe bus. Both GPUs will share this one bus and essentially serializes your data copies.


For a majority of my copies, I only have to transfer the portion of the array that is used for half of the loop. So it seems I still transfer about the same amount of data overall. Is it that transfering half of the array still takes about the same time as transferring the entire array?

Ben
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page Previous  1, 2, 3, 4, 5  Next
Page 3 of 5

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group