|
| View previous topic :: View next topic |
| Author |
Message |
brush
Joined: 26 Jun 2012 Posts: 30
|
Posted: Tue May 21, 2013 3:39 pm Post subject: |
|
|
Hi again- so at the moment I'm splitting the work of a loop between two GPUs, and using OMP to do so. I'm not expecting this to work well for reasons you have mentioned, but I'm still curious.
Previously I accelerated this loop with just a PGI compute region, and another time with just OMP. Both seemed to work.
Now, to achieve my goal I have added some code, and changed the do loop slightly:
| Code: |
call omp_set_num_threads(2)
!$OMP PARALLEL PRIVATE(myid)
myid = OMP_GET_THREAD_NUM()
call acc_set_device_num(myid, acc_device_nvidia)
!$OMP END PARALLEL
chunk=mm/2
!$OMP PARALLEL PRIVATE(exp1,eyp,wx0,wx1,wy0,
!$OMP& wy1,m,i,j,rhog,vfac,kap,th,xt,yt)
!$acc kernels loop
do 100 m=i*chunk+1,i*chunk+chunk ! previously was do 100 m=1, mm
|
The second PRIVATE statement is the same one I used in my "just omp" implementation. However, in this case, upon compiling I get:
| Code: |
pgfortran -fast -Msave -mp -acc -Minfo=accel -c slab.f
ppush:
288, Generating present_or_copy(y3(:))
Generating present_or_copy(y1(:))
Generating present_or_copy(x3(:))
Generating present_or_copy(x1(:))
Generating present_or_copyin(u2(:))
Generating present_or_copy(u3(:))
Generating present_or_copy(u1(:))
Generating present_or_copyin(ey(:,:))
Generating present_or_copyin(ex(:,:))
Generating present_or_copyin(rwx(:lr))
Generating present_or_copyin(x2(:))
Generating present_or_copyin(rwy(:lr))
Generating present_or_copyin(y2(:))
Generating present_or_copyin(mu(:))
Generating present_or_copyin(w2(:))
Generating present_or_copy(w1(:))
Generating present_or_copy(w3(:))
Generating NVIDIA code
Generating compute capability 1.0 binary
Generating compute capability 2.0 binary
Generating compute capability 3.0 binary
289, Complex loop carried dependence of 'u1' prevents parallelization
Complex loop carried dependence of 'u3' prevents parallelization
Loop carried dependence due to exposed use of 'u3(:)' prevents parallelization
Complex loop carried dependence of 'x1' prevents parallelization
Complex loop carried dependence of 'x3' prevents parallelization
Loop carried dependence due to exposed use of 'x3(:)' prevents parallelization
Complex loop carried dependence of 'y1' prevents parallelization
Complex loop carried dependence of 'y3' prevents parallelization
Loop carried dependence due to exposed use of 'y3(:)' prevents parallelization
Loop carried dependence due to exposed use of 'u1(:)' prevents parallelization
Loop carried dependence due to exposed use of 'x1(:)' prevents parallelization
Loop carried dependence due to exposed use of 'y1(:)' prevents parallelization
Complex loop carried dependence of 'w1' prevents parallelization
Complex loop carried dependence of 'w3' prevents parallelization
Loop carried dependence due to exposed use of 'w3(:)' prevents parallelization
Loop carried dependence due to exposed use of 'w1(:)' prevents parallelization
Accelerator kernel generated
295, !$acc loop vector(128) ! threadidx%x
Loop is parallelizable
|
and the code crashes at run time. I am wondering: why would these dependencies pop up when they did not appear on the case with just the PGI directives? Looking at the code, I do not see an actual dependency, as demonstrated by the working "just directives" version.
Thanks,
Ben |
|
| Back to top |
|
 |
mkcolg
Joined: 30 Jun 2004 Posts: 4995 Location: The Portland Group Inc.
|
Posted: Wed May 22, 2013 9:59 am Post subject: |
|
|
Hi Ben,
| Quote: | | why would these dependencies pop up when they did not appear on the case with just the PGI directives? | Sorry, I can't tell from just this. Can you post a bit more of the code, including how these variables are being used? To work around it you can add the "independent" clause to your "kernels" directive.
| Quote: | !$OMP PARALLEL PRIVATE(exp1,eyp,wx0,wx1,wy0,
!$OMP& wy1,m,i,j,rhog,vfac,kap,th,xt,yt)
!$acc kernels loop
do 100 m=i*chunk+1,i*chunk+chunk ! previously was do 100 m=1, mm | Is this just a snip-it and you've cut out code? If not, then I'm wondering if "i" is uninitialized.
- Mat |
|
| Back to top |
|
 |
brush
Joined: 26 Jun 2012 Posts: 30
|
Posted: Wed May 22, 2013 1:10 pm Post subject: |
|
|
| Quote: |
Is this just a snip-it and you've cut out code? If not, then I'm wondering if "i" is uninitialized.
|
Oh wow, you're right! Can't believe I forgot this. I meant to have myid instead of i. Changing this got rid of the unwanted dependancies. Thanks Mat.
Now, at run time I still get some errors, which change from run to run. Perhaps you could tell me what some of these mean:
First run:
Segmentation fault (core dumped)
Second run:
call to cuMemHostUnregister returned error 713: Other
Error: _mp_pcpu_reset: lost thread
Third run:
call to cuMemcpyDtoHAsync returned error 700: Launch failed
Error: _mp_pcpu_reset: lost thread
In case it matters, the new compiler feedback and the full subroutine, respectively. myid is defined before the acc directive, so I think that is okay. Also the number 8404992 is just chunk, as my total particle number is twice that, so that seems okay. Adding copy statements to the code so that the entire arrays are copied still got the same errors.
| Code: | pgfortran -fast -Msave -mp -acc -Minfo=accel -c slab.f
ppush:
288, Generating present_or_copyout(y3(myid*8404992+1:myid*8404992+8404992))
Generating present_or_copy(y1(myid*8404992+1:myid*8404992+8404992))
Generating present_or_copyout(x3(myid*8404992+1:myid*8404992+8404992))
Generating present_or_copy(x1(myid*8404992+1:myid*8404992+8404992))
Generating present_or_copyin(u2(myid*8404992+1:myid*8404992+8404992))
Generating present_or_copyout(u3(myid*8404992+1:myid*8404992+8404992))
Generating present_or_copy(u1(myid*8404992+1:myid*8404992+8404992))
Generating present_or_copyin(ey(:,:))
Generating present_or_copyin(ex(:,:))
Generating present_or_copyin(rwx(:lr))
Generating present_or_copyin(x2(myid*8404992+1:myid*8404992+8404992))
Generating present_or_copyin(rwy(:lr))
Generating present_or_copyin(y2(myid*8404992+1:myid*8404992+8404992))
Generating present_or_copyin(mu(myid*8404992+1:myid*8404992+8404992))
Generating present_or_copyin(w2(myid*8404992+1:myid*8404992+8404992))
Generating present_or_copy(w1(myid*8404992+1:myid*8404992+8404992))
Generating present_or_copyout(w3(myid*8404992+1:myid*8404992+8404992))
Generating NVIDIA code
Generating compute capability 1.0 binary
Generating compute capability 2.0 binary
Generating compute capability 3.0 binary
289, Loop is parallelizable
Accelerator kernel generated
289, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
295, Loop is parallelizable
|
| Code: | c
subroutine ppush
use OMP_LIB
include 'slab.h'
real exp1,eyp
real wx0,wx1,wy0,wy1
integer m,i,j,myid,chunk
real rhog,vfac
real kap,th
c
! hardcoded for 2 threads
call omp_set_num_threads(2)
c
!$OMP PARALLEL PRIVATE(myid)
myid = OMP_GET_THREAD_NUM()
call acc_set_device_num(myid, acc_device_nvidia)
!$OMP END PARALLEL
chunk=mm/2
dv=dx*dy
!$OMP PARALLEL PRIVATE(exp1,eyp,wx0,wx1,wy0,
!$OMP& wy1,m,i,j,rhog,vfac,kap,th,xt,yt)
!$acc kernels loop
do 100 m=myid*chunk+1,myid*chunk+chunk
c
exp1=0.
eyp=0.
rhog=sqrt(mu(m))/mims
do l=1,lr
xt=x2(m)+rwx(l)*rhog
yt=y2(m)+rwy(l)*rhog
if(xt.gt.lx) xt=2.*lx-xt
if(xt.lt.0.) xt=-xt
if(xt.eq.lx) xt=0.9999*lx
if(yt.ge.ly) yt=yt-ly
if(yt.le.0.) yt=yt+ly
if(yt.eq.ly) yt=0.9999*ly
if (ngp.eq.1) then
i=int(xt/dx+0.5)
j=int(yt/dy+0.5)
exp1=exp1 + ex(i,j)
eyp=eyp + ey(i,j)
else
i=int(xt/dx)
j=int(yt/dy)
c
wx0=float(i+1)-xt/dx
wx1=1.-wx0
wy0=float(j+1)-yt/dy
wy1=1.-wy0
c
exp1=exp1 + wx0*wy0*ex(i,j) + wx1*wy0*ex(i+1,j)
% + wx0*wy1*ex(i,j+1) + wx1*wy1*ex(i+1,j+1)
c
eyp=eyp + wx0*wy0*ey(i,j) + wx1*wy0*ey(i+1,j)
% + wx0*wy1*ey(i,j+1) + wx1*wy1*ey(i+1,j+1)
endif
enddo ! end loop over 4 pt avg
exp1=exp1/float(lr)
eyp=eyp/float(lr)
c
c
c LINEAR: epara=0. for no e para.
c
th=theta
c1 th=(x2(m)-0.5*lx)/ls
u3(m)=u1(m)+ epara*2.*dt*q*mims*(eyp*th)
c
vfac=0.5*( u2(m)*u2(m) + mu(m) )*mims/tets
c vfac=0.5*( u2(m)*u2(m) - 1.0 )*mims/tets
c
kap=( kapn -(1.5-vfac)*kapt )
c kap=( kapn + vfac*kapt )
c
c LINEAR: next 3 lines are commented out if linear...
c
x3(m)=x1(m)+ 2.*dt*( ecrossb*eyp )
y3(m)=y1(m)+ 2.*dt*( u2(m)*th + ecrossb*(-exp1) )
c
u1(m)=u2(m) + .25*( u3(m) - u1(m) )
x1(m)=x2(m) + .25*( x3(m) - x1(m) )
y1(m)=y2(m) + .25*( y3(m) - y1(m) )
if(x3(m).gt.lx) x3(m)=2.*lx-x3(m)
if(x3(m).lt.0.) x3(m)=-x3(m)
if(x3(m).eq.lx) x3(m)=0.9999*lx
if(y3(m).ge.ly) y3(m)=y3(m)-ly
if(y3(m).le.0.) y3(m)=y3(m)+ly
if(y3(m).eq.ly) y3(m)=0.9999*ly
c
c now, calculate weight for linearized case
c
c---------weigthing delft-f (dependent on ldtype)----
w3(m)=w1(m) + 2.*dt*( eyp*kap+q*tets*
% (th*eyp*u2(m)) )
% *(1-w2(m))
c
w1(m)=w2(m) + .25*( w3(m) - w1(m) )
c
100 continue
!$OMP END PARALLEL
90 continue
c
return
end
|
|
|
| Back to top |
|
 |
mkcolg
Joined: 30 Jun 2004 Posts: 4995 Location: The Portland Group Inc.
|
Posted: Wed May 22, 2013 1:40 pm Post subject: |
|
|
myid isn't private in the second parallel region so this might be the source of the new error. Try merging the two regions.
| Code: |
chunk=mm/2
dv=dx*dy
!$OMP PARALLEL PRIVATE(myid,exp1,eyp,wx0,wx1,wy0,
!$OMP& wy1,m,i,j,rhog,vfac,kap,th,xt,yt)
myid = OMP_GET_THREAD_NUM()
call acc_set_device_num(myid, acc_device_nvidia)
!$acc kernels loop
do 100 m=myid*chunk+1,myid*chunk+chunk |
You might need to set the environment variable PGI_ACC_SYNCHRONOUS=1, especially if you're using an earlier 13.x compiler.
- Mat |
|
| Back to top |
|
 |
brush
Joined: 26 Jun 2012 Posts: 30
|
Posted: Wed May 22, 2013 4:28 pm Post subject: |
|
|
Thanks Mat, that mostly did the trick (I get the correct output now, it just crashes occasioanlly but less often with PGI_ACC_SYNCHRONOUS=1).
I had another question with reguards to something you said earlier:
| Quote: | | The bottle neck here is the PCIe bus. Both GPUs will share this one bus and essentially serializes your data copies. |
For a majority of my copies, I only have to transfer the portion of the array that is used for half of the loop. So it seems I still transfer about the same amount of data overall. Is it that transfering half of the array still takes about the same time as transferring the entire array?
Ben |
|
| Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
Powered by phpBB © 2001, 2002 phpBB Group
|