PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Course

Complex loop carried dependence

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
wsawyer



Joined: 19 Jan 2011
Posts: 14

PostPosted: Mon Dec 21, 2015 7:14 am    Post subject: Complex loop carried dependence Reply with quote

We are struggling to port a Fortran2003 radiation parameterization to OpenACC (see separate correspondence), and are now plodding through compiler warnings to determine why the GPU version yields different output. The compiler is flagging a "Complex loop carried dependence", which in my mind has no reason to be:


Code:

  1941, Complex loop carried dependence of tmp_lay_src prevents parallelization
         Loop carried dependence of tmp_lay_src prevents parallelization
         Loop carried backward dependence of tmp_lay_src prevents vectorization


The 'offensive' code is:

Code:
 
   if (present(sfc_src)) then
      ilay = merge(1,nlay,play(1,1) > play(nlay,1)) ! surface layer index
      ALLOCATE( tmp_sfc_src( SIZE(sfc_src,1), SIZE(sfc_src,2) ) )
!$ACC DATA CREATE( tmp_sfc_src )
!$ACC PARALLEL PRESENT( pfrac )
!$ACC LOOP GANG
      do icol = 1, ncol
        ! tmp_sfc_src(icol,:) = pfrac(:,ilay,icol) * this%expand(plnksfc(:,icol))
        tmp_sfc_src(icol,:) = 0;
!$ACC LOOP VECTOR
        do iband=1,nband
          tmp_sfc_src(icol,this_band2gpt(1,iband):this_band2gpt(2,iband)) = &
            tmp_sfc_src(icol,this_band2gpt(1,iband):this_band2gpt(2,iband)) + &
            plnksfc(iband,icol)
        end do
        tmp_sfc_src(icol,:) = tmp_sfc_src(icol,:) * pfrac(:,ilay,icol)
      end do ! icol
!$ACC END PARALLEL
!$ACC UPDATE HOST( tmp_sfc_src )
      sfc_src = tmp_sfc_src
!$ACC END DATA
      DEALLOCATE( tmp_sfc_src )
    end if

Admittedly the array syntax might be confusing for the compiler, so I tried explicitly writing out the loops and evening making the inner loop SEQ, but all to no avail:

Code:

           do iband=1,nband
-            tmp_lay_src(icol,ilay,this_band2gpt(1,iband):this_band2gpt(2,iband)) = &
-              tmp_lay_src(icol,ilay,this_band2gpt(1,iband):this_band2gpt(2,iband)) + &
-              plnklay(ilay,iband,icol)
+!$ACC LOOP SEQ
+            do this_band=this_band2gpt(1,iband),this_band2gpt(2,iband)
+              tmp_lay_src(icol,ilay,this_band) = &
+                tmp_lay_src(icol,ilay,this_band) + &
+                plnklay(ilay,iband,icol)
+            end do
           end do
-          tmp_lay_src(icol,ilay,:) = tmp_lay_src(icol,ilay,:) * pfrac(:,ilay,icol)
+          do igpt=1,ngpt
+            tmp_lay_src(icol,ilay,igpt) = tmp_lay_src(icol,ilay,igpt) * pfrac(igpt,ilay,icol)
+          end do
         end do ! ilay


Three questions: (1) do you see any real dependency here? Conceptually, at least, there should be none. (2) How can one convince the compiler there is no dependence? And (3) Is there any reason to believe that this perceived dependence could be the reason the OpenACC code is yielding different results than the CPU-only code?

Thanks, --Will
Code:
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6660
Location: The Portland Group Inc.

PostPosted: Mon Dec 21, 2015 9:54 am    Post subject: Reply with quote

Hi Will,

Quote:
(1) do you see any real dependency here? Conceptually, at least, there should be none.
Are the variables pointers? If so, this would cause dependencies since the compiler must assume that the variables may overlap in memory.

Quote:
(2) How can one convince the compiler there is no dependence?

In OpenACC, the work around would be to add a loop directive but this of course doesn't work for array syntax given there are no explicit loops. The other option is to expand the loops (as you've done) so the loop construct can be applied.

Alternatively, if these arrays could be changed to be allocatables, then the compiler most likely can auto-parallelize the implicit array syntax loops.

I've put in a feature request so that OpenACC might allow for loop constructs on implicit loops such as these. Though if adopted in the standard, wont be available for some time.


Quote:
(3) Is there any reason to believe that this perceived dependence could be the reason the OpenACC code is yielding different results than the CPU-only code?

Possible but difficult for me to tell for sure without analysis. My thought is that since the array assignment is at the gang level loop, all the vectors will execute the implicit array syntax loop redundantly. This could lead to a read/write race condition on tmp_lay_src since it appears on both the left and right-hand sides of the assignment.


My suggest on this code is to not expand the arrays but instead make the "icol" loop a gang vector:

Code:
!$ACC PARALLEL PRESENT( pfrac )
!$ACC LOOP GANG VECTOR
      do icol = 1, ncol


The compiler will still complain about the dependencies since it will try and auto-parallelize the inner loops (unless you use the flag "-acc=noautopar"), but it will only parallelize the "icol" loop.

The caveat being that you'll loose some performance, especially if ncol is small, but hopefully will produce correct answers. You can then work on manually expanding the loops to see if you can exploit more parallelism.

Hope this helps,
Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group