PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

Complex loop carried dependence

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
wsawyer



Joined: 19 Jan 2011
Posts: 7

PostPosted: Fri Nov 16, 2012 3:25 am    Post subject: Complex loop carried dependence Reply with quote

Hi Mat,

Thanks again for your tireless replies to the questions in this forum.

A number of issues are now coming up as I put OpenACC into the full application. I'll try to bring these up as separate topics, and simply the specific issue as far as possible. Here is the first.

The following triply-nest loop is repeated in similar form multiple times in the application:

Several levels above the kernel in the program hierachy we have:

!$ACC DATA COPY( metrics_exner_ref_mc_d, z_exner_ex_pr_d, lots of other stuff)

:

Then:

c_startidx = GET_STARTIDX_C(rl_start,1)
c_endidx = GET_ENDIDX_C(rl_end, MAX(1,p_patch%n_childdom))

!$ACC KERNELS &
!$ACC PRESENT( metrics_exner_ref_mc_d, z_exner_ex_pr_d )
!$ACC LOOP GANG
DO jb = i_startblk, i_endblk

IF ( i_startblk == jb ) THEN; i_startidx = c_startidx; ELSE; i_startidx\
= 1; ENDIF
IF ( i_endblk == jb ) THEN; i_endidx = c_endidx; ELSE; i_endidx = nprom\
a; ENDIF


!$ACC LOOP VECTOR COLLAPSE(2) !!!! BUG: COLLAPSE(2) causes 12.10 to crash!
DO jc = i_startidx, i_endidx
!DIR$ VECTOR
DO jk = 1, nlev
z_exner_ex_pr_d(jc,jk,jb) = - metrics_exner_ref_mc_d(jc,jk,jb)
ENDDO
ENDDO
ENDDO
!$ACC END KERNELS

First: if "COLLAPSE(2)" is present, the 12.10 compiler crashes with the error:


pgfortran-Fatal-/apps/castor/pgi-12.10/linux86-64/12.10/bin/pgf902 TERMINATED by sig\
nal 11
Arguments to /apps/castor/pgi-12.10/linux86-64/12.10/bin/pgf902
/apps/castor/pgi-12.10/linux86-64/12.10/bin/pgf902 /tmp/pgfortranllqfHP_u9pbx.ilm -f\
n ../../../src/atm_dyn_iconam/mo_solve_nonhydro.f90 -opt 3 -terse 1 -inform warn -x \
51 0x20 -x 119 0xa10000 -x 122 0x40 -x 123 0x1000 -x 127 4 -x 127 17 -x 19 0x400000 \
-x 28 0x40000 -x 120 0x10000000 -x 70 0x8000 -x 122 1 -x 125 0x20000 -quad -vect 56 \
-y 34 16 -x 34 0x8 -x 32 12582912 -y 19 8 -y 35 0 -x 42 0x30 -x 39 0x40 -x 39 0x80 -\
x 34 0x400000 -x 149 1 -x 150 1 -x 59 4 -x 59 4 -tp nehalem -x 120 0x1000 -x 124 0x1\
400 -y 15 2 -x 57 0x3b0000 -x 58 0x48000000 -x 49 0x100 -x 120 0x200 -astype 0 -x 12\
1 1 -x 124 1 -x 9 1 -x 42 0x14200000 -x 72 0x1 -x 136 0x11 -x 80 0x800000 -quad -x 1\
19 0x10000000 -x 129 0x40000000 -x 129 2 -x 164 0x1000 -x 186 0x80000 -x 180 0x400 -\
x 180 0x4000000 -x 163 0x1 -x 186 0x80000 -x 180 0x400 -x 180 0x4000000 -x 186 2 -ac\
cel nvidia -x 176 0x140000 -x 177 0x0202007f -x 0 0x1000000 -x 2 0x100000 -x 0 0x200\
0000 -x 161 16384 -x 162 16384 -cmdline '+pgfortran ../../../src/atm_dyn_iconam/mo_s\
olve_nonhydro.f90 -I../include -I../../../src/include -I/apps/castor/zlib/1.2.7/gnu_\
463/install/include -I/apps/castor/mvapich2/1.8.1/mvapich2-pgi/include -O3 -fastsse \
-fast -Mvect=sse -Mscalarsse -Mcache_align -Mflushz -Mpre -I../module -D__ICON__ -D_\
_LOOP_EXCHANGE -DPGI_COMPILER -DNO_NETCDF -DDSL_INLINE= -acc -ta=nvidia -Minfo=accel\
-Mpreprocess -c -I/apps/castor/mvapich2/1.8.1/mvapich2-pgi/include -I/apps/castor/m\
vapich2/1.8.1/mvapich2-pgi/include' -asm /tmp/pgfortrantlqf54gApNPC.sm

OK, the crash is obviously an issue, but it is clear from many other similar loops, that the compiler does not want to parallelize the i_startidx, i_endidx loop. If you look at the IF statements just above it, you will see that the look is almost rectangular, with only indentations for the first and last gang. If there is a way to express this which will help the compiler parallelize this loop, please let me know.

For what it is worth, the prototype CUDAFortran implementation for the same code is:

! loop through all patch cells (and blocks)
!
jb = blockidx%x + ( i_startblk -1 )
jc = threadidx%x
jk = threadidx%y ! [1 .. nlev]

IF ( ( i_startblk < jb .and. jb < i_endblk ) .or. &
( i_startblk == jb .and. i_startidx <= jc ) .or. &
( i_endblk == jb .and. jc <= i_endidx ) ) THEN

! extrapolated perturbation Exner pressure (used for horizontal gradients only)
z_exner_ex_pr(jc,jk,jb) = - exner_ref_mc(jc,jk,jb)
ENDIF

And this works great (pity we cannot use CUDAFortran in the real application).

So if the COLLAPSE(2) is removed and we get the warning:

22, Complex loop carried dependence of 'metrics_exner_ref_mc_d' prevents parallelization
Complex loop carried dependence of 'z_exner_ex_pr_d' prevents parallelization
Inner sequential loop scheduled on accelerator
Loop is parallelizable

The "complex loop carried dependence" (CLCD) does not make sense to me. I've read some of your replies on CLCD when indirect addressing is involved and understand it in those cases, but that would not be the case here.

Of course, the actual calculation inside the loop is much more complicated, and I now have an OpenACC version (albeit with above CLCD warning) which produces valid results. I now want to remove the CLCD issue.

Thanks, --Will
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6120
Location: The Portland Group Inc.

PostPosted: Fri Nov 16, 2012 1:57 pm    Post subject: Reply with quote

Hi Will,

If you could send a reproducing example for the pgf902 segv to PGI Customer Service (trs@pgroup.com), we would appreciate it. Obviously it's a major compiler issue that we'd like to get fixed. It would also help in determining the complex loop dependency as well.

My best guess is that the problems stem from the fact that the vector length is uniform for all gangs, but "jc" loop bound is variable. This may be causing an unexpected code path to be taken in the compiler. Though we can't be sure until we can reproduce the error.

As a work around, you may try interchanging the "jk" and "jc" loops or push "jk" above the IF statements.

Something like:

Code:
!$ACC LOOP VECTOR
DO jk = 1, nlev
DO jc = i_startidx, i_endidx
z_exner_ex_pr_d(jc,jk,jb) = - metrics_exner_ref_mc_d(jc,jk,jb)
ENDDO
ENDDO
ENDDO
!$ACC END KERNELS


or
Code:

!$ACC KERNELS &
!$ACC PRESENT( metrics_exner_ref_mc_d, z_exner_ex_pr_d )
DO jb = i_startblk, i_endblk
DO jk = 1, nlev

IF ( i_startblk == jb ) THEN; i_startidx = c_startidx; ELSE; i_startidx\
= 1; ENDIF
IF ( i_endblk == jb ) THEN; i_endidx = c_endidx; ELSE; i_endidx = nprom\
a; ENDIF

DO jc = i_startidx, i_endidx
z_exner_ex_pr_d(jc,jk,jb) = - metrics_exner_ref_mc_d(jc,jk,jb)
ENDDO
ENDDO
ENDDO
!$ACC END KERNELS


- Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group