PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

atomicadd for double precision in CUDA Fortran
Goto page Previous  1, 2, 3, 4, 5  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling
View previous topic :: View next topic  
Author Message
tlstar



Joined: 31 Mar 2011
Posts: 22

PostPosted: Sun Apr 10, 2011 2:51 am    Post subject: Reply with quote

Done!

tlstar wrote:
I think that the most possible reason for different results from emulation (run in series of threads) and CUDA release may due to the conflicting on the write device memory.


The rounding error may not be treated as the same in GPU & CPU.
This error is then propagating in the Random Number Generation algorithm.

So even for Fermi, TESLA C2050 is not full IEEE754 standard?


Ref. from CUDA wiki:
For double precision (for GPUs supporting CUDA compute capability 1.3 and above[12]) there are some deviations from the IEEE 754 standard: round-to-nearest-even is the only supported rounding mode for reciprocal, division, and square root. In single precision, denormals and signalling NaNs are not supported; only two IEEE rounding modes are supported (chop and round-to-nearest even), and those are specified on a per-instruction basis rather than in a control word; and the precision of division/square root is slightly lower than single precision.


The RNG Algorithm used here is very classical, something like the following.

Algorithm 1: Combined Multiple Recursive Generator

For i = 3 to n
Xi = (1403580Xi-2 - 810728Xi-3)(mod 4294967087)
Yi = (527612Yi-1 - 1370589Xi-3)(mod 4294944443)
Zi = (Xi - Yi)(mod 4294967087)
If (Zi > 0)
Ui = Zi/4294967088
If (Zi = 0)
Ui = 4294967087/4294967088
i = i + 1
End i


Last edited by tlstar on Mon Apr 11, 2011 7:28 am; edited 1 time in total
Back to top
View user's profile
tlstar



Joined: 31 Mar 2011
Posts: 22

PostPosted: Mon Apr 11, 2011 2:40 am    Post subject: Reply with quote

Hi Mat,

I have made two updates in my code:
1. compile cell_loo_GPU.F90 in "-o0" model
To avoid vector optimization in "do-looping" of calling kernel.
settled the 4*14*BLOCK_SIZE problem

2. Round off random seeds in aleatoire_init_GPU
This makes the exact same random number gotten by emu & GPU
!===============================================================================

SUBROUTINE aleatoire_init_GPU(nblock)

...............
! normalize the seeds
random(1:3,:) = AINT(random(1:3,:)*m1,8)
random(4:6,:) = AINT(random(4:6,:)*m2,8)
! write(0,*) "random seeds inited"
! write(0,*) random(:,1:10)
...............

END SUBROUTINE aleatoire_init_GPU

!===============================================================================

But the results are still not the same between GPU & emu.
As we do not have a debug tool for GPU fortran , it's really awful to investigate.

I hope the GPU fortran debug could be available soon. Otherwise the costs on debugging codes would be more than the translate the codes into C.

Could you tell me how to setup the "be" tool (.gpu to .ptx) to "-o0"?

gfwang
Back to top
View user's profile
tlstar



Joined: 31 Mar 2011
Posts: 22

PostPosted: Mon Apr 11, 2011 4:34 am    Post subject: Reply with quote

Bug report:

In kernel Fortran source code file:
Code:
cos_theta = 1.0d0 - 2.d0 * nb_aleatoire(randseed)


Compiled by pgfortran into low level c:

Code:
cos_theta = (1.00000000000000000E+0)-((nb_aleatoire((signed char*)(_prandseed)))+(nb_aleatoire((signed char*)(_prandseed))));


.....

Notice nb_aleatoire is a function, the value is depending on the INTENT(INOUT) argument randseed. The translation is not equal.
Furthermore, I do not understand why the compiler would like to optimize this. As we all known, multiplication operation (*) is no slower then the addition in GPU or modern CPU, and much faster than getting a function value.

Quote:
NOTE: your trial license will expire in 3 days, 12.7 hours.


I think I should be awarded with a longer term trail license of pgfortran compiler for my bug digging work in compiler itself.
Back to top
View user's profile
tlstar



Joined: 31 Mar 2011
Posts: 22

PostPosted: Mon Apr 11, 2011 7:26 am    Post subject: Reply with quote

But report (or future request) 2:

All initial GPU constant variables are set to 0 or 0.000 after pgfortran compiling into low-level C code (.gpu), without respection to user-defined values.

Code:
DOUBLE PRECISION, constant :: epsilon_paroi = 0.5


into

Code:
__constant__ struct{
int* m0;long long m8; ............................ ;double m2600;double m2608;
}__align__(16) _raycast_gpukernel_17 = { 0,0,......,-0.000000 };
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6119
Location: The Portland Group Inc.

PostPosted: Mon Jul 18, 2011 3:34 pm    Post subject: Reply with quote

Hi gfwang,

FYI, support for floating point atomics (TPR#17778) was added a few releases ago (sorry for the late update). The only caveat is that you need a device that supports CC2.0 to use them.

- Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling All times are GMT - 7 Hours
Goto page Previous  1, 2, 3, 4, 5  Next
Page 4 of 5

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group