Reproducibility of atomic operations

OpenACC and CUDA Fortran
Post Reply
sumseq
Posts: 141
Joined: Nov 27 2012

Reproducibility of atomic operations

Post by sumseq » Mon Oct 28, 2019 6:36 pm

Hi,

If I have the following code:

Code: Select all

!$acc kernels default(present) present(a,sum0,sum1)
!$acc loop independent
        do k=2,npm1
!$acc loop independent
          do i=1,nrm1
!$acc atomic
            sum0(i)=sum0(i)+a%r(i,2,k)*dph(k)*pl_i*two
          enddo
        enddo
!$acc end kernels        
Should I expect the same answer each time I run it, or is there a chance the atomics are done in a different order each time so my results will vary due to order-of-sum floating point variations?

I ask because the code this is part of is returning floating-point level different answers every time I run it on the same GPU, same code, etc.

If so, is there some kind of environment variable I can set to make the compiler/run-time compute the atomics in a reproducible way (even if slower)?

Thanks,

- Ron

mkcolg
Posts: 8107
Joined: Jun 30 2004

Re: Reproducibility of atomic operations

Post by mkcolg » Tue Oct 29, 2019 7:56 am

Hi Ron,
or is there a chance the atomics are done in a different order each time so my results
The order in which CUDA threads are run is non-deterministic, hence the atomic can executed in a different order each time the code is run.

Is this the full loop? If so, you may want to interchange the loops and then only run "i" in parallel. Each thread will then sum one element of sum0 and help with reproducability. You can also then try doing a vector reduction across "k", but may encounter a similar issue. Though with fewer threads per reduction, rounding error may not be as bad.

Code: Select all

!$acc kernels loop default(present) present(a,sum0,sum1)
    do i=1,nrm1
        sum = 0
! Optionally use a vector reduction
!acc loop vector reduction(+:sum)
        do k=2,npm1
            sum=sum+a%r(i,2,k)*dph(k)*pl_i*two
          enddo
         sum0(i) = sum0(i) + sum
       enddo 
Hope this helps,
Mat

sumseq
Posts: 141
Joined: Nov 27 2012

Re: Reproducibility of atomic operations

Post by sumseq » Tue Oct 29, 2019 9:38 am

Hi,

Thanks!

So there is no ENV I can set to force the threads to be deterministic for the atomics for testing?

I have previously tried inverting the loops but the performance went down because the single loop dimension is too small to parallelize well across the GPU.

I have not tried the new array-reduction support yet because I typically only add a new feature to the code when the feature is supported in the latest community edition.
Are array-reductions supported in 19.4?

Thanks!

- Ron

mkcolg
Posts: 8107
Joined: Jun 30 2004

Re: Reproducibility of atomic operations

Post by mkcolg » Tue Oct 29, 2019 3:24 pm

Hi Ron,
So there is no ENV I can set to force the threads to be deterministic for the atomics for testing?
Not that I'm aware of. Scheduling is done by the CUDA driver and I high doubt there's a way to force the order in which the threads read/write to memory.
Are array-reductions supported in 19.4?
No, we're still working on adding support for this.

-Mat

Post Reply