PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

Shared memory race condition despite using syncthreads

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
Robert Link



Joined: 14 Jun 2010
Posts: 9
Location: Goddard Space Flight Center

PostPosted: Thu Jul 01, 2010 8:44 am    Post subject: Shared memory race condition despite using syncthreads Reply with quote

I appear to be getting a read-after-write race condition in a shared memory variable, despite syncing after after the write. Here's the offending snippet

Code:

    if(tid .eq. 1) then
       if(jeval .eq. 1) then
          jeval  = 0
          hratio = 1.0_rkind
          nslp   = nsteps + MBETWEEN
          drate  = 0.7_rkind
       end if
       l3 = 0
    end if

    call syncthreads()

    call trace(indx,tid,tcount,jeval)

Trace just writes its last argument into a global memory buffer for retrieval by the host. Upon examining the trace, jeval is 0 for thread 1, but still 1 for thread 2. It is as if jeval is not being treated as shared. However, examining the PTX shows the expected shared loads and stores:
Code:

 //<loop> Part of loop body line 1908, head labeled $Lt_1_60674
   .loc   2   662   0
   @%p2 bra    $Lt_1_9986;           // <--- skip if tid .ne. 1
 //<loop> Part of loop body line 1908, head labeled $Lt_1_60674
   .loc   2   663   0
   ld.shared.s32    %r1981, [jeval]; 
   mov.u32    %r1982, 1;
   setp.ne.s32    %p333, %r1981, %r1982;
   @%p333 bra    $Lt_1_10242;          // <--- skip if jeval .ne. 1
 //<loop> Part of loop body line 1908, head labeled $Lt_1_60674
   .loc   2   664   0
   mov.s32    %r1983, 0;
   st.shared.s32    [jeval], %r1983;           // <---- jeval stored as expected
        // other variables omitted
$Lt_1_10242:
 //<loop> Part of loop body line 1908, head labeled $Lt_1_60674
   .loc   2   669   0
   mov.s32    %r1986, 0;
   st.shared.s32    [l3], %r1986;
$Lt_1_9986:
 //<loop> Part of loop body line 1908, head labeled $Lt_1_60674
   .loc   2   671   0
   bar.sync    0;                                        // <----------- barrier
   .loc   2   1908   0
   @%p2 bra    $Lt_1_51458;
 //<loop> Part of loop body line 1908, head labeled $Lt_1_60674
   .loc   2   1909   0
   add.s32    %r1987, %r273, 1;
   mov.s32    %r273, %r1987;
   .loc   2   1910   0
   ld.shared.s32    %r1988, [jeval];        // <--------- jeval loaded as expected
   add.s32    %r1989, %r6, %r1987;


Has anybody ever seen anything similar to this? Any idea what the cause might be?
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6134
Location: The Portland Group Inc.

PostPosted: Thu Jul 01, 2010 2:29 pm    Post subject: Reply with quote

Hi Robert,

I've sent your question to several others but unfortunately no one has seen this behavior before. One person wondered if your kernel could be failing and asked if you could add a call to 'cudaThreadSynchronize()' and 'cudaGetLastError()' after your kernel launch on the host.

- Mat
Back to top
View user's profile
Robert Link



Joined: 14 Jun 2010
Posts: 9
Location: Goddard Space Flight Center

PostPosted: Thu Jul 01, 2010 3:09 pm    Post subject: Reply with quote

mkcolg wrote:
Hi Robert,

I've sent your question to several others but unfortunately no one has seen this behavior before. One person wondered if your kernel could be failing and asked if you could add a call to 'cudaThreadSynchronize()' and 'cudaGetLastError()' after your kernel launch on the host.

- Mat


Mat,

cudaGetLastError() reported successful completion.

Nobody on the nvidia forums had ever seen similar behavior either. I've removed the shared variables for now and replaced them with local variables, which cures the divergence problem (at a cost of upping the register count significantly). When I have more time I'll circle back and try to figure out what was going on.
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group