PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

atomicadd for double precision in CUDA Fortran
Goto page Previous  1, 2, 3, 4, 5  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling
View previous topic :: View next topic  
Author Message
tlstar



Joined: 31 Mar 2011
Posts: 22

PostPosted: Fri Apr 08, 2011 5:24 am    Post subject: Reply with quote

Hi Mat,

Thanks a lot.

Now, I manage to make it works. several efforts are made:

1. change some expressions in kernel code, like
Code:
 
        point_in(:) = 0.99D0 * A_dev(:, point) &
          + 0.01D0/4.0D0 * (A_dev(:,B_dev(1,cell)) + A_dev(:,B_dev(2,cell)) + A_dev(:,B_dev(3,cell)) + A_dev(:,B_dev(4,cell)))


Change into

Code:
        point1 = B_dev(1,cell)
        point2 = B_dev(2,cell)
        point3 = B_dev(3,cell)
        point4 = B_dev(4,cell)
        point_in(:) = 0.99D0 * A_dev(:, point) &
          + 0.01D0/4.0D0 * (A_dev(:,point1) + A_dev(:,point2) + A_dev(:,point3) + A_dev(:,point4))

A_dev are DOUBLE PRECISION, device arrays; B_dev are INTEGER, device arrays.

2. stop the optimization of pgfortran link in makefile

Code:
F90     = pgfortran -module $(MODULE) -Mmpi=mpich1 -Mcuda -Mbyteswapio

 OPT     = -O3 -tp nehalem-64
 GPUOPT  = -O3 -ta=nvidia:cuda3.2 -Mcuda=keepbin -Mcuda=keepptx -Mcuda=ptxinfo -v
 LN_OPT  = -O0
#
# compilation rules
#
.SUFFIXES : .CUF .c .f .f90 .mod .F90
.mod.o :
    $(F90) $(OPT) $(INCLUDES) -o $*.o -c $*.f90
.CUF.o :
    $(F90) $(GPUOPT) $(INCLUDES) -o $*.o -c $*.CUF
.f90.o :
    $(F90) $(OPT) $(INCLUDES) -o $*.o -c $*.f90
.F90.o :
    $(F90) $(OPT) $(INCLUDES) -o $*.o -c $*.F90
.c.o :
    $(CC)  $(INCLUDES) -o $*.o -c $*.c
#   $(CC)  -O3 $(INCLUDES) -ffast-math -o $*.o -c $*.c
#  compilation
#
$(TARGET) : $(OBJECTS_F90) $(OBJECTS_C)
    $(F90) $(LN_OPT) $(OBJECTS_C) $(OBJECTS_F90) -o $@
#



If I change it with LN_OPT = -O3 (linking parameter), the run is stopped with the same error. Maybe there is a bug in the pgfortran link (strange). I try to provide my codes to you for test.

3. change the block & grid parameters to run kernel
Code:
CALL raycast<<<ATOMIC_RAYS/BLOCK_SIZE,BLOCK_SIZE>>>(point_dev, cell_dev, simul_dev,energy_inter_dev)


defined as
Code:
INTEGER, PARAMETER            ::  BLOCK_SIZE=128
INTEGER, parameter              :: ATOMIC_RAYS = 14*BLOCK_SIZE


But if I increase ATOMIC_RAYS into a larger value, for example 2*14*BLOCKSIZE, the code stopped with the same error.
I do not quite know about these two parameters.
BLOCK_SIZE should be 32, 64, 96, 128, 256 ... 1024? But also limited by registers?

In the compiling, I see the registers occupation following. 122 registers & 63 registers, which is the exact value? And what is the lmem, smem, do they limit the total thread number?

Code:
....
ptxas info    : Compiling entry function 'raycast' for 'sm_13'
ptxas info    : Used 122 registers, 168+0 bytes lmem, 40+16 bytes smem, 2768 bytes cmem[0], 120 bytes cmem[1], 4 bytes cmem[14]
....
ptxas info    : Compiling entry function 'raycast' for 'sm_20'
ptxas info    : Used 63 registers, 8+0 bytes lmem, 72 bytes cmem[0], 2768 bytes cmem[2], 4 bytes cmem[14], 40 bytes cmem[16]
  0 inform,   0 warnings,   0 severes, 0 fatal for ..cuda_fortran_constructor_1
PGF90/x86-64 Linux 11.3-0: compilation successful



And my GPU:
Code:
 One CUDA device found

 Device Number: 0
   Device name: Tesla M2050
   Compute Capability: 2.0
   Number of Multiprocessors: 14
   Number of Cores: 448
   Max Clock Rate (kHz): 1147000
   Warpsize: 32

    Execution Configuration Limits
      Maximum Grid Dimensions: 65535 x 65535 x 1
      Maximum Block Dimensions: 1024 x 1024 x 64
      Maximum Threads per Block: 1024

    Off-Chip Memory
      Total Global Memory (B): 2817982464
      Total Constant Memory (B): 65536
      Maximum Memory Pitch for Copies (B): 2147483647
      Integrated: No

    On-Chip Memory
      Shared Memory per Multiprocessor (B): 49152
      Number of Registers per Multiprocessor: 32768


How to setup best blocksize and grid parameter to call kernel? If we have sufficient loads to execute.


Quote:
Possible, but given the code you had posted earlier, it's more likely a programing error. I would need a reproducing example to be sure. Please feel free to send the code to PGI Customer Service (trs@pgroup.com) and ask them to send it to me.


Thanks a lot. It's great to me, if you can help me to review the code.
I would try, though the program has a big input files (200M) to run.



Quote:

Though, since it is running on the CPU, there still can be differences than running on the GPU. For example, on a CPU if you write beyond the end of an array, the code most likely wont seg fault. You may stomp over another variable's data and cause other problems, but not seg fault. On the GPU, accessing memory even one element beyond the end of an array will trigger a seg fault. Adding array bounds checking (-Mbounds) in emulation mode should help find these errors.


I have used -Mbounds, since I read your posts (long-time ago) in the forum. It seems to be not a array bounds problem, at least not a simple bounds checking issue.

In emu mode, it works and gets the exact results as the CPU version. With the efforts mentioned above, it works in CUDA. But the results is not right.
Though we have random number part in the code. I store random seeds (6 double precision number) for each threads in device memory: load in threads start and save in the end of threads. I do not know whether this is a wise strategy.

Thanks again for your help. And thanks in advance for your comments.
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6216
Location: The Portland Group Inc.

PostPosted: Fri Apr 08, 2011 10:53 am    Post subject: Reply with quote

Hi tlstar,
Quote:

Maybe there is a bug in the pgfortran link (strange). I try to provide my codes to you for test.
Possible, but without a reproducible example I can't tell. Though, you just sent in the code to PGI Customer Service, so I'll take a look.

Quote:
But if I increase ATOMIC_RAYS into a larger value, for example 2*14*BLOCKSIZE, the code stopped with the same error.
You can have 64K blocks so 28 should be fine. If anything it's too small.

Quote:
BLOCK_SIZE should be 32, 64, 96, 128, 256 ... 1024? But also limited by registers?
The block size will depend upon many factors. Registers are one factor, shared memory use, your algorithm, and the device being used are others.

Quote:
In the compiling, I see the registers occupation following. 122 registers & 63 registers, which is the exact value?
The compiler is generating two separate device binaries. One targeting devices with compute capability 1.3 and the other targeting 2.0 devices. The 122 registers is for CC1.3 devices and 63 is for CC2.0. -Mcuda=ccXX flags allow you specify the CC binary to generate is you don't need both targets.

Note that currently your code can use up to 16384 registers per block. So having a block size of 128 threads and 122 registers being used per thread, you are using 15616 registers per block. Hence, you most likely don't want to use use a larger block size, unless you're using a Fermi (CC2.0) card.

Quote:
And what is the lmem, smem, do they limit the total thread number?
Local memory and Shared Memory per thread. 'cmem' is constant memory. The output from 'pgaccelinfo' will list the maximums.

Quote:
How to setup best blocksize and grid parameter to call kernel?
Dr. Michael Wolfe likes to joke that there is PHd topic here if you can find a generic solution to finding the optimal schedule. Right now the state of the art solution is to try all possible combinations, which of course, is impractical. Instead it's more of an art that comes from experience and trial and error. You need base it upon what's best for your algorithm, how your memory is laid out, available device resources, etc.

Your kernel is fairly simple, thread-wise, so should be able to just try a few BLOCK_SIZE to find what gives you the best performance.

Quote:
have used -Mbounds, since I read your posts (long-time ago) in the forum. It seems to be not a array bounds problem, at least not a simple bounds checking issue.
It the first thing to check since it's a common error, especially when it "works" in emulation but fails on the GPU.

Quote:
Though we have random number part in the code. I store random seeds (6 double precision number) for each threads in device memory: load in threads start and save in the end of threads. I do not know whether this is a wise strategy.
I'll need to look at the code, but yes it could be a problem. It sounds like you may be creating a dependency.

- Mat
Back to top
View user's profile
tlstar



Joined: 31 Mar 2011
Posts: 22

PostPosted: Fri Apr 08, 2011 12:48 pm    Post subject: Reply with quote

Done!

mkcolg wrote:

Quote:
But if I increase ATOMIC_RAYS into a larger value, for example 2*14*BLOCKSIZE, the code stopped with the same error.
You can have 64K blocks so 28 should be fine. If anything it's too small.


Please help me to have a test on ATOMIC_RAYS=2*14*BLOCKSIZE. It will lead to a error :
0: copyout Memcpy (host=0x7c9a70, dev=0x1e720000, size=28672) FAILED: 4(unspecified launch failure)
Why? While it works on ATOMIC_RAYS=14*BLOCKSIZE.

It works while BLOCK_SIZE=64 & ATOMIC_RAYS = 2*14*BLOCK_SIZE; while fails with BLOCK_SIZE=64 & ATOMIC_RAYS = 4*14*BLOCK_SIZE. Seems the total threads number is limited.

Quote:
In the compiling, I see the registers occupation following. 122 registers & 63 registers, which is the exact value?The compiler is generating two separate device binaries. One targeting devices with compute capability 1.3 and the other targeting 2.0 devices. The 122 registers is for CC1.3 devices and 63 is for CC2.0. -Mcuda=ccXX flags allow you specify the CC binary to generate is you don't need both targets.

Note that currently your code can use up to 16384 registers per block. So having a block size of 128 threads and 122 registers being used per thread, you are using 15616 registers per block. Hence, you most likely don't want to use use a larger block size, unless you're using a Fermi (CC2.0) card.


I'm using Tesla M2050. So CC2.0 is loaded automatically?
Device Number: 0
Device name: Tesla M2050
Compute Capability: 2.0
Number of Multiprocessors: 14
Number of Cores: 448
Max Clock Rate (kHz): 1147000
Warpsize: 32

Quote:
And what is the lmem, smem, do they limit the total thread number?Local memory and Shared Memory per thread. 'cmem' is constant memory. The output from 'pgaccelinfo' will list the maximums.


It would be better, if the pgfortran can allow "INTEGER, CONSTANT, ALLOCATABLE, DIMENSION(:) ::" type. In my code, lots of data are only readable for the kernel codes and written by host. But the length is depending on the input files. CONSTANT variables will be visited faster?

Quote:
Your kernel is fairly simple, thread-wise, so should be able to just try a few BLOCK_SIZE to find what gives you the best performance.


Thanks. I hope I could get a good speedup. However, quite a little bit more point-visiting on the device memory are involved. Could you comment on the optimization?

Quote:
Though we have random number part in the code. I store random seeds (6 double precision number) for each threads in device memory: load in threads start and save in the end of threads. I do not know whether this is a wise strategy.I'll need to look at the code, but yes it could be a problem. It sounds like you may be creating a dependency.


Maybe. Notice I store ATOMIC_RAYS number of random seeds, generate by the fortran RAND_NUMBER(). I think this instrics function are based on bit-shifting method, not the same as the function I implemented. If you have a better way for RNG (low cost and less seeds), please recommend.

Thanks!


Last edited by tlstar on Mon Apr 11, 2011 7:27 am; edited 1 time in total
Back to top
View user's profile
tlstar



Joined: 31 Mar 2011
Posts: 22

PostPosted: Fri Apr 08, 2011 1:00 pm    Post subject: Reply with quote

Done!

I think that the most possible reason for different results from emulation (run in series of threads) and CUDA release may due to the conflicting on the write device memory.

But I only have two write operations:
1. energy_inter(id_thread) = energy_per_realisation
2. CALL aleatoire_save_GPU(id_thread,randseed)

And id_thread = (blockIdx%x - 1) * blockDim%x + threadIdx%x

It should avoid the conflicting already.



Also, the total threads number can't increase, really strange to me.


Last edited by tlstar on Mon Apr 11, 2011 7:28 am; edited 1 time in total
Back to top
View user's profile
tlstar



Joined: 31 Mar 2011
Posts: 22

PostPosted: Sat Apr 09, 2011 5:46 am    Post subject: Reply with quote

tlstar wrote:
mkcolg wrote:

Quote:
But if I increase ATOMIC_RAYS into a larger value, for example 2*14*BLOCKSIZE, the code stopped with the same error.
You can have 64K blocks so 28 should be fine. If anything it's too small.


Please help me to have a test on ATOMIC_RAYS=2*14*BLOCKSIZE. It will lead to a error :
0: copyout Memcpy (host=0x7c9a70, dev=0x1e720000, size=28672) FAILED: 4(unspecified launch failure)
Why? While it works on ATOMIC_RAYS=14*BLOCKSIZE.

It works while BLOCK_SIZE=64 & ATOMIC_RAYS = 2*14*BLOCK_SIZE; while fails with BLOCK_SIZE=64 & ATOMIC_RAYS = 4*14*BLOCK_SIZE. Seems the total threads number is limited.


It's because I have a loop to call kernel subroutine. The vector optimization by pgfortran -O3, seems to be reorganize the loop. Then several kernel subroutines may be launched together by the unproper optimization of the compiler, leads to the conflicts.

Temporary solution: change to the pgfortran -O0 to compile the code to call kernel.
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling All times are GMT - 7 Hours
Goto page Previous  1, 2, 3, 4, 5  Next
Page 3 of 5

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group