PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

block sizes/if loops give inconsistent results

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling
View previous topic :: View next topic  
Author Message
amitamritkar



Joined: 02 Oct 2009
Posts: 11

PostPosted: Fri Mar 15, 2013 12:23 pm    Post subject: block sizes/if loops give inconsistent results Reply with quote

Hi, I have a GPU Kernel (part of a much larger code) which is called from CPU with different block dimensions as shown below.

The 2 block/grid sizes that I used are as follows,
Case # 1 dimblock = dim3(960,1,1) dimgrid = dim3(320,1,1)
Case # 2 dimblock = dim3(480,1,1) dimgrid = dim3(640,1,1)

I only get correct results (=CPU result) for case # 2. Case # 1 gives completely incorrect results.

I know that there are no thread bound checks in the code below and I should ideally get segmentation errors but that doesn't happen. When I put in the thread bound checks, the situation is still the same (only case # 2 gives correct results).

For both the cases iorth=1. So when I remove the iorth=0 loop from the kernel, I get the correct results for both the cases. This seems to be a very strange behavior and I am wondering if there is something that I am missing here in the code and are there any runtime check flags?

To compile the code, I use the flags, -fast -Mr8 -Minfo=all -Mcuda=5.0 -ta=nvidia,cc20
Code:
c Kernel Call from CPU
      call kernel_pc<<<dimGrid,dimBlock>>>(d_phi,d_rhs,d_phi1,
     1         ni,nj,nk,mba,m,iorth,unrelax)

Code:
c Device kernel
      attributes(global)
     1 subroutine kernel_pc(phi,rhs,phi1,ni,nj,nk,mba,
     1                                m,iorth,unrelax)

      implicit none
      real,device,dimension(ni,nj,nk,mba),intent(in)::phi,rhs
      real,device,dimension(ni,nj,nk,mba),intent(inout)::phi1
      integer,value,intent(in) :: ni,nj,nk,mba,m,iorth
      real,value,intent(in)::unrelax
      integer :: i,j,k,im1,ip1,jp1,jm1,kp1,km1,nb,ii,jj,kk,
     1           ie,je,ke

      nb = blockidx%x
      ii = threadidx%x
      i = dev_i_b_blk(nb,m)+ii-1
      im1=i-1; ip1=i+1
      jj = threadidx%y
      j = dev_j_b_blk(nb,m)+jj-1
      jm1=j-1; jp1=j+1
      kk = threadidx%z
      k = dev_k_b_blk(nb,m)+kk-1
      km1=k-1; kp1=k+1

      if(iorth.eq.0)then
      phi1(i,j,k,m) = (1. - unrelax) * phi(i,j,k,m)
     1              + unrelax * ( rhs(i,j,k,m) +
     1               ( dev_ap(1,i,j,k,m)*phi(i,jp1,k,m)
     1                +dev_ap(2,i,j,k,m)*phi(i,jm1,k,m)
     1                +dev_ap(3,i,j,k,m)*phi(ip1,j,k,m)
     1                +dev_ap(4,i,j,k,m)*phi(im1,j,k,m)
     1                +dev_ap(5,i,j,k,m)*phi(i,j,kp1,m)
     1                +dev_ap(6,i,j,k,m)*phi(i,j,km1,m)
     1                +dev_ap(7,i,j,k,m)*phi(ip1,jp1,k,m)
     1                +dev_ap(8,i,j,k,m)*phi(im1,jp1,k,m)
     1                +dev_ap(9,i,j,k,m)*phi(ip1,jm1,k,m)
     1               ) * dev_sps(i,j,k,m)
     1              ) / dev_ap(19,i,j,k,m)
      call syncthreads()
      endif

      if(iorth.eq.1)then
      phi1(i,j,k,m) = (1. - unrelax) * phi(i,j,k,m)
     1              + unrelax * ( rhs(i,j,k,m) +
     1               ( dev_ap(1,i,j,k,m)*phi(i,jp1,k,m)
     1                +dev_ap(2,i,j,k,m)*phi(i,jm1,k,m)
     1                +dev_ap(3,i,j,k,m)*phi(ip1,j,k,m)
     1                +dev_ap(4,i,j,k,m)*phi(im1,j,k,m)
     1                +dev_ap(5,i,j,k,m)*phi(i,j,kp1,m)
     1                +dev_ap(6,i,j,k,m)*phi(i,j,km1,m)
     1               ) * dev_sps(i,j,k,m)
     1              ) / dev_ap(19,i,j,k,m)
      call syncthreads()
      endif
      end subroutine
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 4996
Location: The Portland Group Inc.

PostPosted: Mon Mar 18, 2013 8:47 am    Post subject: Reply with quote

Hi amitamritkar,

My first guess would be that it's an optimization issue. What happens if you compile with "-O0" instead of "-fast"?

- Mat
Back to top
View user's profile
amitamritkar



Joined: 02 Oct 2009
Posts: 11

PostPosted: Mon Mar 18, 2013 11:19 am    Post subject: Reply with quote

Mat,

Even with -O0 optimization flag, I don't get the correct results. (Case 1 gives incorrect numbers).
Any other suggestions?

Thanks,
Amit
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 4996
Location: The Portland Group Inc.

PostPosted: Mon Mar 18, 2013 11:29 am    Post subject: Reply with quote

Ok, then it's a problem with your code and not optimization. Exactly what, I'm not sure.

Can you send a reproducing example to PGI Customer Service (trs@pgroup.com) and ask them to forward it to me?

Note that I'll be attending the NVIDIA GTC conference this week so it may be a few days before I can investigate.

- Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2002 phpBB Group