PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

3D grids

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
jand



Joined: 17 Aug 2008
Posts: 57

PostPosted: Thu Jul 05, 2012 2:00 pm    Post subject: 3D grids Reply with quote

Hello,

It is my understanding that compute capability 2.x supports 3D grids (pgaccelinfo indicates this for my card).
Is it correct that pgfortran only supports a 2D grid (z-index must be equal to 1)? If so, will a future release support 3D grids?

Thanks, Jan
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6141
Location: The Portland Group Inc.

PostPosted: Thu Jul 05, 2012 3:38 pm    Post subject: Reply with quote

Hi Jan,

The allowable kernel schedule isn't a PGI restriction, rather it has more to with the CUDA driver runtime environment and your device. Hence, if your card and driver supports 3D Grids, then you can use 3D Grids in your CUDA Fortran code.

Hope this helps,
Mat
Back to top
View user's profile
jand



Joined: 17 Aug 2008
Posts: 57

PostPosted: Fri Jul 06, 2012 8:14 am    Post subject: Reply with quote

Hi Mat,

Do you know which driver supports 3D grids? My device (GTX580) supports is according to pgaccelinfo. In my code, I can define a 3D grid and the kernel launches and finishes without error message but the result is wrong. Some debugging in emu mode showed that the blockidx%z never counts above 1, even though griddim%z is 2.

I got concerned because in the guide is states on page 7 that " The value of blockidx%z is always one."

Thanks, Jan


:~$ pgaccelinfo
CUDA Driver Version: 4020
NVRM version: NVIDIA UNIX x86_64 Kernel Module 295.41 Fri Apr 6 23:18:58 PDT 2012

Device Number: 0
Device Name: GeForce GTX 580
Device Revision Number: 2.0
Global Memory Size: 1609760768
Number of Multiprocessors: 16
Number of Cores: 512
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 32768
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 65535 x 65535 x 65535
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: 1544 MHz
Execution Timeout: No
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: Yes
ECC Enabled: No
Memory Clock Rate: 2004 MHz
Memory Bus Width: 384 bits
L2 Cache Size: 786432 bytes
Max Threads Per SMP: 1536
Async Engines: 1
Unified Addressing: Yes
Initialization time: 1203421 microseconds
Current free memory: 1535369216
Upload time (4MB): 2255 microseconds (1587 ms pinned)
Download time: 3341 microseconds (2321 ms pinned)
Upload bandwidth: 1860 MB/sec (2642 MB/sec pinned)
Download bandwidth: 1255 MB/sec (1807 MB/sec pinned)
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6141
Location: The Portland Group Inc.

PostPosted: Fri Jul 06, 2012 10:32 am    Post subject: Reply with quote

Hi Jan,

Your driver version is fine. Also, I just tried a trivial example that seems to work. The CUDA version does need to be 4.0, so if you're using PGI 11.6 to 11.9, add the flag "-Mcuda=cuda4.0". The 12.x compilers use CUDA 4.0 by default so should just work.

Note that it does appear our emulation mode still assumes that the Z dimension is always 1. I've added a problem report (TPR#18799) and sent it to our engineers.

Here's my trivial example:
Code:
% cat test3D.cuf
module test3D

    integer, device, allocatable, dimension(:,:,:) :: Ad
    integer, allocatable, dimension(:,:,:) :: A
   
contains

   attributes(global) subroutine test3Dsub(l,m,n)

     integer, value :: l,m,n
     integer ix,iy,iz

     ix = (blockidx%x-1) * blockdim%x + threadidx%x
     iy = (blockidx%y-1) * blockdim%y + threadidx%y
     iz = (blockidx%z-1) * blockdim%z + threadidx%z

     if (ix .le. l .and. iy .le. m .and. iz .le. n) then
   Ad(ix,iy,iz) = iz
     endif
     
   end subroutine test3Dsub

end module test3D

program test
  use cudafor
  use test3D

  integer l,m,n
  type(dim3) :: blocks
  type(dim3) :: threads
   
  l=64
  m=128
  n=64

  threads = dim3(8,8,8)
  blocks = dim3(l/8, m/8, n/8)

  allocate(A(l,m,n), Ad(l,m,n))
  Ad=0
  call test3Dsub <<<blocks,threads>>>(l,m,n)
  A=Ad
  print *, A(45,23,:)   

end program test
% pgf90 test3D.cuf -V12.5 ; a.out
            1            2            3            4            5            6
            7            8            9           10           11           12
           13           14           15           16           17           18
           19           20           21           22           23           24
           25           26           27           28           29           30
           31           32           33           34           35           36
           37           38           39           40           41           42
           43           44           45           46           47           48
           49           50           51           52           53           54
           55           56           57           58           59           60
           61           62           63           64
% pgaccelinfo
CUDA Driver Version:           4020
NVRM version: NVIDIA UNIX x86_64 Kernel Module  295.59  Wed Jun  6 21:19:40 PDT 2012

Device Number:                 0
Device Name:                   Tesla C2070
Device Revision Number:        2.0
Global Memory Size:            6441598976
Number of Multiprocessors:     14
Number of Cores:               448
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 49152
Registers per Block:           32768
Warp Size:                     32
Maximum Threads per Block:     1024
Maximum Block Dimensions:      1024, 1024, 64
Maximum Grid Dimensions:       65535 x 65535 x 65535
Maximum Memory Pitch:          2147483647B
Texture Alignment:             512B
Clock Rate:                    1147 MHz
Execution Timeout:             No
Integrated Device:             No
Can Map Host Memory:           Yes
Compute Mode:                  default
Concurrent Kernels:            Yes
ECC Enabled:                   No
Memory Clock Rate:             1494 MHz
Memory Bus Width:              384 bits
L2 Cache Size:                 786432 bytes
Max Threads Per SMP:           1536
Async Engines:                 2
Unified Addressing:            Yes
Initialization time:           1174550 microseconds
Current free memory:           6367338496
Upload time (4MB):              968 microseconds ( 712 ms pinned)
Download time:                 1042 microseconds ( 672 ms pinned)
Upload bandwidth:              4332 MB/sec (5890 MB/sec pinned)
Download bandwidth:            4025 MB/sec (6241 MB/sec pinned)
Back to top
View user's profile
jand



Joined: 17 Aug 2008
Posts: 57

PostPosted: Fri Jul 06, 2012 11:00 am    Post subject: Reply with quote

Hi Mat,

thanks, adding the flag cuda4.0 did the trick on 11.9. I can confirm that the emu mode does not count beyond 1.

Jan
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group