PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

32x32 block size problem

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
jand



Joined: 17 Aug 2008
Posts: 57

PostPosted: Fri Mar 09, 2012 12:29 pm    Post subject: 32x32 block size problem Reply with quote

Hi,

I am currently writing some simple cuda fortran code (I'm just learning cuda fortran) and came across some strange behavior with regard to the grid. The card I am using is a GTX580 which should allow a block size of 32x32.

My code works fine for almost any block size (e.g., 8x8, 8x16, 8x32, 15x32, 16x32). However, when I choose something larger than 16x32 (e.g., 17x32), the code returns all zeros. I did some debugging but just can't seem to figure out why this should happen.

Is there something I am missing with this card, that I actually cannot go beyond the block size of 512?

Any help is greatly appreciated.

Jan

Ps:
The code just computes some matrix, where I figure out the matrix indexing by
Code:

  i  = (blockidx%x-1) * blockdim%x + threadidx%x                                                             
  j = (blockidx%y-1) * blockdim%y + threadidx%y   


And my grid is defined:
Code:

dimGrid = dim3( NX/NBLX, NY/NBLY, 1)                                                                     
dimBlock = dim3( NBLX, NBLY, 1 )

NX and NY set size of matrix and NBLX, NBLY set the block size.
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6119
Location: The Portland Group Inc.

PostPosted: Fri Mar 09, 2012 12:37 pm    Post subject: Reply with quote

Hi Jan,

What is the error message returned after the kernel launch? My guess you're hitting some other limit such as the number of registers, shared memory, etc.

- Mat

To check the error status of a kernel:
Code:
call somekernel{{{blocks, threads>>>(dA, dB, dC)  ! replace { with <
ierr = cudaGetLastError()
if (ierr .ne. 0) then
   print *, cudaGetErrorString(ierr)
endif

Back to top
View user's profile
jand



Joined: 17 Aug 2008
Posts: 57

PostPosted: Fri Mar 09, 2012 2:59 pm    Post subject: Reply with quote

OK thanks. The error I get is

"too many resources requested for launch"

So it appears your guess is correct.
When I add
-Mcuda=maxregcount:32
the code runs fine (no error and correct answer). Does it make sense to set this limit or is it better to limit the block size?

Jan
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6119
Location: The Portland Group Inc.

PostPosted: Fri Mar 09, 2012 5:16 pm    Post subject: Reply with quote

Quote:
Does it make sense to set this limit or is it better to limit the block size?

Increasing the number of active threads (i.e. the Occupancy), can lead to better performance simply because you are utilising the GPU more. However, if the cost means each thread needs to make more fetches from global memory, which is what happens when you restrict the number of registers per thread, it may negate the improvement.

Schedule tuning is a bit of a black art so it's best to try several and see what works for the particular kernel. Be sure to use profiling, either via pgcollect/PGDBG or setting the environment flag "CUDA_PROFILE=1", to gauge what works best.

- Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group