PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

Unknown 8GB memory getting allocated on GPU
Goto page 1, 2, 3  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
anirbanjana



Joined: 11 Aug 2012
Posts: 28

PostPosted: Tue Sep 24, 2013 11:34 am    Post subject: Unknown 8GB memory getting allocated on GPU Reply with quote

Hi,
I am getting an out of memory error on the GPU just before one of my GPU kernels is launched. To investigate more, I ran the code with PGI_ACC_DEBUG=1, and found that a memory allocation request for 8GB is made on the device. But I am not sure what variable. It seemed to finish moving an array named neighbours just before the error. Below is an excerpt:
Code:

...
pgi_uacc_dataon(devptr=0x60,hostptr=0x7fa1a11a6860,offset=0,0,stride=1,18486,size=18471x27,extent=18486x27,eltsize=4,lineno=404,name=neighbours,flags=0x2700=create+present+copyin+inexact,threadid=1)
pgi_uacc_dataon( devid=1, threadid=1 ) dindex=1
NO map for host:0x7fa1a11a6860
pgi_uacc_alloc(size=1996488,devid=1,threadid=1)
pgi_uacc_alloc(size=1996488,devid=1,threadid=1) returns 0xb02500000
map    dev:0xb02500000 host:0x7fa1a11a6860 size:1996488 offset:0 data[dev:0xb02500000 host:0x7fa1a11a6860 size:1996488] (line:404 name:neighbours) dims=18486x27
alloc done with devptr at 0xb02500000
pgi_uacc_pin(devptr=0x0,hostptr=0x7fa1a11a6860,offset=0,0,stride=1,18486,size=18471x27,extent=18486x27,eltsize=4,lineno=404,name=neighbours,flags=0x0,threadid=1)
MemHostRegister( 0x7fa1a11a6860, 1996488, 0 )
pgi_uacc_dataupx(devptr=0xb02500000,hostptr=0x7fa1a11a6860,offset=0,0,stride=1,18486,size=18471x27,extent=18486x27,eltsize=4,lineno=404,name=neighbours,flags=0x0,threadid=1)
pgi_uacc_cuda_dataup2(devdst=0xb02500000,hostsrc=0x7fa1a11a6860,offset=0,0,stride=1,18486,size=18471,27,eltsize=4,lineno=404,name=neighbours)
pgi_uacc_datadone( async=-1, devid=1 )
pgi_uacc_cuda_wait(lineno=-1,async=-1,dindex=1)
pgi_uacc_cuda_wait(sync on stream=(nil))
pgi_uacc_cuda_wait done
pgi_uacc_alloc(size=443304,devid=1,threadid=1)
pgi_uacc_alloc(size=443304,devid=1,threadid=1) returns 0xb0246c600
pgi_uacc_alloc(size=443304,devid=1,threadid=1)
pgi_uacc_alloc(size=443304,devid=1,threadid=1) returns 0xb02700000
pgi_uacc_alloc(size=443304,devid=1,threadid=1)
pgi_uacc_alloc(size=443304,devid=1,threadid=1) returns 0xb0276c400
pgi_uacc_alloc(size=443304,devid=1,threadid=1)
pgi_uacc_alloc(size=443304,devid=1,threadid=1) returns 0xb02800000
pgi_uacc_alloc(size=443304,devid=1,threadid=1)
pgi_uacc_alloc(size=443304,devid=1,threadid=1) returns 0xb0286c400
pgi_uacc_alloc(size=443304,devid=1,threadid=1)
pgi_uacc_alloc(size=443304,devid=1,threadid=1) returns 0xb02900000
pgi_uacc_alloc(size=443304,devid=1,threadid=1)
pgi_uacc_alloc(size=443304,devid=1,threadid=1) returns 0xb0296c400
pgi_uacc_alloc(size=443304,devid=1,threadid=1)
pgi_uacc_alloc(size=443304,devid=1,threadid=1) returns 0xb02a00000
pgi_uacc_alloc(size=443304,devid=1,threadid=1)
pgi_uacc_alloc(size=443304,devid=1,threadid=1) returns 0xb02a6c400
pgi_uacc_alloc(size=443304,devid=1,threadid=1)
pgi_uacc_alloc(size=443304,devid=1,threadid=1) returns 0xb02b00000
pgi_uacc_alloc(size=443304,devid=1,threadid=1)
pgi_uacc_alloc(size=443304,devid=1,threadid=1) returns 0xb02b6c400
pgi_uacc_alloc(size=443304,devid=1,threadid=1)
pgi_uacc_alloc(size=443304,devid=1,threadid=1) returns 0xb02c00000
pgi_uacc_alloc(size=443304,devid=1,threadid=1)
pgi_uacc_alloc(size=443304,devid=1,threadid=1) returns 0xb02c6c400
pgi_uacc_alloc(size=443304,devid=1,threadid=1)
pgi_uacc_alloc(size=443304,devid=1,threadid=1) returns 0xb02d00000
pgi_uacc_alloc(size=443304,devid=1,threadid=1)
pgi_uacc_alloc(size=443304,devid=1,threadid=1) returns 0xb02d6c400
pgi_uacc_alloc(size=443304,devid=1,threadid=1)
pgi_uacc_alloc(size=443304,devid=1,threadid=1) returns 0xb02e00000
pgi_uacc_alloc(size=443304,devid=1,threadid=1)
pgi_uacc_alloc(size=443304,devid=1,threadid=1) returns 0xb02e6c400
pgi_uacc_alloc(size=8194917744,devid=1,threadid=1)
call to cuMemAlloc returned error 2: Out of memory
  P0


I ran it step by step using cuda-gdb. I got the memory error again, but no new message that shed more light on the cause of the 8GB allocation. Below is an excerpt from cuda-gdb:
Code:

[Launch of CUDA Kernel 1 (calc_force_des_150_gpu<<<(18471,1,1),(256,1,1)>>>) on Device 0]
[Launch of CUDA Kernel 2 (calc_force_des_180_gpu_red<<<(1,1,1),(256,1,1)>>>) on Device 0]
[Termination of CUDA Kernel 1 (calc_force_des_150_gpu<<<(18471,1,1),(256,1,1)>>>) on Device 0]

Breakpoint 1, calc_force_des () at calc_force_des.f:404
404     !$acc parallel
(cuda-gdb) s
__pgi_uacc_dataon (filename=0xae5560 "/lvol/home/anirban/mfix/mfix.0011b/model/./des/calc_force_des.f",
    funcname=0xae55a0 "calc_force_des", pdevptr=0x7fffffffd920, hostptr=0x7fffefd96020, dims=2, desc=0x7fffffffd310, elementsize=8,
    lineno=404, name=0xae59d7 "des_pos_new", flags=9984, async=-1, devid=1) at dataon.c:39
39      dataon.c: No such file or directory.
        in dataon.c
(cuda-gdb) s
43      in dataon.c
(cuda-gdb) s
48      in dataon.c
(cuda-gdb) s
50      in dataon.c
(cuda-gdb) s
52      in dataon.c
(cuda-gdb)
54      in dataon.c
(cuda-gdb)
55      in dataon.c
(cuda-gdb)
56      in dataon.c
(cuda-gdb)
57      in dataon.c
(cuda-gdb)
64      in dataon.c
(cuda-gdb)
66      in dataon.c
(cuda-gdb)
69      in dataon.c
(cuda-gdb)
74      in dataon.c
(cuda-gdb)
77      in dataon.c
(cuda-gdb)
__pgi_uacc_adjust (pdims=0x7fffffffcebc, desc=0x7fffffffd310) at adjust.c:31
31      adjust.c: No such file or directory.
        in adjust.c
(cuda-gdb)
32      in adjust.c
(cuda-gdb)
34      in adjust.c
(cuda-gdb)
36      in adjust.c
(cuda-gdb)
37      in adjust.c
(cuda-gdb)
42      in adjust.c
(cuda-gdb)
43      in adjust.c
(cuda-gdb)
44      in adjust.c
(cuda-gdb)
50      in adjust.c
(cuda-gdb)
56      in adjust.c
(cuda-gdb)
34      in adjust.c
(cuda-gdb)
36      in adjust.c
(cuda-gdb)
37      in adjust.c
(cuda-gdb)
42      in adjust.c
(cuda-gdb)
43      in adjust.c
(cuda-gdb)
44      in adjust.c
(cuda-gdb)
50      in adjust.c
(cuda-gdb)
56      in adjust.c
(cuda-gdb)
34      in adjust.c
(cuda-gdb)
67      in adjust.c
(cuda-gdb)
68      in adjust.c
(cuda-gdb)
78      in adjust.c
(cuda-gdb)
82      in adjust.c
(cuda-gdb)
84      in adjust.c
(cuda-gdb)
85      in adjust.c
(cuda-gdb)
86      in adjust.c
(cuda-gdb)
87      in adjust.c
(cuda-gdb)
67      in adjust.c
(cuda-gdb)
92      in adjust.c
(cuda-gdb)
93      in adjust.c
(cuda-gdb)
94      in adjust.c
(cuda-gdb)
__pgi_uacc_dataon (filename=0xae5560 "/lvol/home/anirban/mfix/mfix.0011b/model/./des/calc_force_des.f",
    funcname=0xae55a0 "calc_force_des", pdevptr=0x7fffffffd920, hostptr=0x7fffefd96020, dims=1, desc=0x7fffffffd310, elementsize=8,
    lineno=404, name=0xae59d7 "des_pos_new", flags=9984, async=-1, devid=1) at dataon.c:78
78      dataon.c: No such file or directory.
        in dataon.c
(cuda-gdb)
87      in dataon.c
(cuda-gdb)
88      in dataon.c
(cuda-gdb)
92      in dataon.c
(cuda-gdb)

Program received signal SIGTRAP, Trace/breakpoint trap.
[Switching to Thread 0x7fffe8943700 (LWP 14598)]
0x00000036fe8de2f3 in select () from /lib64/libc.so.6
(cuda-gdb)
Single stepping until exit from function select,
which has no line number information.

warning: Cuda API error detected: cuMemAlloc_v2 returned (0x2)

call to cuMemAlloc returned error 2: Out of memory
[Thread 0x7fffe8943700 (LWP 14598) exited]

Program exited with code 01.
[Termination of CUDA Kernel 2 (calc_force_des_180_gpu_red<<<(1,1,1),(256,1,1)>>>) on Device 0]
[Termination of CUDA Kernel 0 (desgrid_neigh_build_gpu_507_gpu<<<(145,1,1),(128,1,1)>>>) on Device 0]
(cuda-gdb)
The program is not being run.

[Launch of CUDA Kernel 1 (calc_force_des_150_gpu<<<(18471,1,1),(256,1,1)>>>) on Device 0]
[Launch of CUDA Kernel 2 (calc_force_des_180_gpu_red<<<(1,1,1),(256,1,1)>>>) on Device 0]
[Termination of CUDA Kernel 1 (calc_force_des_150_gpu<<<(18471,1,1),(256,1,1)>>>) on Device 0]

Breakpoint 1, calc_force_des () at calc_force_des.f:404
404     !$acc parallel
(cuda-gdb) s
__pgi_uacc_dataon (filename=0xae5560 "/lvol/home/anirban/mfix/mfix.0011b/model/./des/calc_force_des.f",
    funcname=0xae55a0 "calc_force_des", pdevptr=0x7fffffffd920, hostptr=0x7fffefd96020, dims=2, desc=0x7fffffffd310, elementsize=8,
    lineno=404, name=0xae59d7 "des_pos_new", flags=9984, async=-1, devid=1) at dataon.c:39
39      dataon.c: No such file or directory.
        in dataon.c
(cuda-gdb) s
43      in dataon.c
(cuda-gdb) s
48      in dataon.c
(cuda-gdb) s
50      in dataon.c
(cuda-gdb) s
52      in dataon.c
(cuda-gdb)
54      in dataon.c
(cuda-gdb)
55      in dataon.c
(cuda-gdb)
56      in dataon.c
(cuda-gdb)
57      in dataon.c
(cuda-gdb)
64      in dataon.c
(cuda-gdb)
66      in dataon.c
(cuda-gdb)
69      in dataon.c
(cuda-gdb)
74      in dataon.c
(cuda-gdb)
77      in dataon.c
(cuda-gdb)
__pgi_uacc_adjust (pdims=0x7fffffffcebc, desc=0x7fffffffd310) at adjust.c:31
31      adjust.c: No such file or directory.
        in adjust.c
(cuda-gdb)
32      in adjust.c
(cuda-gdb)
34      in adjust.c
(cuda-gdb)
36      in adjust.c
(cuda-gdb)
37      in adjust.c
(cuda-gdb)
42      in adjust.c
(cuda-gdb)
43      in adjust.c
(cuda-gdb)
44      in adjust.c
(cuda-gdb)
50      in adjust.c
(cuda-gdb)
56      in adjust.c
(cuda-gdb)
34      in adjust.c
(cuda-gdb)
36      in adjust.c
(cuda-gdb)
37      in adjust.c
(cuda-gdb)
42      in adjust.c
(cuda-gdb)
43      in adjust.c
(cuda-gdb)
44      in adjust.c
(cuda-gdb)
50      in adjust.c
(cuda-gdb)
56      in adjust.c
(cuda-gdb)
34      in adjust.c
(cuda-gdb)
67      in adjust.c
(cuda-gdb)
68      in adjust.c
(cuda-gdb)
78      in adjust.c
(cuda-gdb)
82      in adjust.c
(cuda-gdb)
84      in adjust.c
(cuda-gdb)
85      in adjust.c
(cuda-gdb)
86      in adjust.c
(cuda-gdb)
87      in adjust.c
(cuda-gdb)
67      in adjust.c
(cuda-gdb)
92      in adjust.c
(cuda-gdb)
93      in adjust.c
(cuda-gdb)
94      in adjust.c
(cuda-gdb)
__pgi_uacc_dataon (filename=0xae5560 "/lvol/home/anirban/mfix/mfix.0011b/model/./des/calc_force_des.f",
    funcname=0xae55a0 "calc_force_des", pdevptr=0x7fffffffd920, hostptr=0x7fffefd96020, dims=1, desc=0x7fffffffd310, elementsize=8,
    lineno=404, name=0xae59d7 "des_pos_new", flags=9984, async=-1, devid=1) at dataon.c:78
78      dataon.c: No such file or directory.
        in dataon.c
(cuda-gdb)
87      in dataon.c
(cuda-gdb)
88      in dataon.c
(cuda-gdb)
92      in dataon.c
(cuda-gdb)

Program received signal SIGTRAP, Trace/breakpoint trap.
[Switching to Thread 0x7fffe8943700 (LWP 14598)]
0x00000036fe8de2f3 in select () from /lib64/libc.so.6
(cuda-gdb)
Single stepping until exit from function select,
which has no line number information.

warning: Cuda API error detected: cuMemAlloc_v2 returned (0x2)

call to cuMemAlloc returned error 2: Out of memory
[Thread 0x7fffe8943700 (LWP 14598) exited]

Program exited with code 01.
[Termination of CUDA Kernel 2 (calc_force_des_180_gpu_red<<<(1,1,1),(256,1,1)>>>) on Device 0]
[Termination of CUDA Kernel 0 (desgrid_neigh_build_gpu_507_gpu<<<(145,1,1),(128,1,1)>>>) on Device 0]
(cuda-gdb)
The program is not being run.


Finally, running the code with cuda-memcheck causes a hang. "ps x" shows the foll:
Code:

...
14455 pts/0    S+     0:00 cuda-memcheck mfix.exe
14456 pts/0    Rl+   14:17 mfix.exe
...


Any advice on how to proceed to fix this will be great.
Thanks much
Anirban
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6215
Location: The Portland Group Inc.

PostPosted: Tue Sep 24, 2013 12:31 pm    Post subject: Reply with quote

Hi Anirban,

Can you post the OpenACC directives you're using at line 150 of calc_force_des.f? Are you privatizing any arrays?

My guess is that you're privatizing a large array which will create a unique copy for each thread.

- Mat
Back to top
View user's profile
anirbanjana



Joined: 11 Aug 2012
Posts: 28

PostPosted: Wed Sep 25, 2013 10:15 am    Post subject: Reply with quote

Hi Mat,
You are right! I am privatizing some arrays. In fact, I have a 3-level nested DO loop where the outer loop is large, being over the # particles, and the inner ones are much smaller, but I am asking the compiler to execute the inner two loops sequentially. I have posted the code below (starts at Line 404, in fact the loop at Line 150 is fine, as the code works when I comment the ACC directives around the loop starting at Line 404):

Code:

!$acc data copy(tmp_ax(dimn))
!$acc parallel
!$acc loop  private(LL, PFT_TMP, DIST, DIST_CL, DIST_CI, OMEGA_SUM, VRELTRANS, V_ROT, NORMAL, TANGENT, VSLIP, &
!$acc&              DTSOLID_TMP, DIST_OLD, VRN_OLD, NORMAL_OLD, sigmat, sigmat_old, FNS1, FNS2, FN, norm_old) &
!$acc&              reduction(+: DIST0_EVENTS, NEG_NORM_VEL_1ST_CONTACT_EVENTS, EXCESSIVE_OVERLAP_EVENTS)
! phase_I, phase_LL

      DO LL=1,MAX_PIP
         IF(.NOT.PEA(LL,1) .OR. PEA(LL,4) ) CYCLE

         PFT_TMP(:) = ZERO
         PARTICLE_SLIDE = .FALSE.

! Check particle LL neighbour contacts
!---------------------------------------------------------------------

!$acc loop seq
            DO II = 2, NEIGHBOURS(LL,1)+1
               I = NEIGHBOURS(LL,II)
               IF(PEA(I,1)) THEN
                  ALREADY_IN_CONTACT =.FALSE.

                  DO NEIGH_L = 2, PN(LL,1)+1
                     IF(I.EQ. PN(LL,NEIGH_L)) THEN
                        ALREADY_IN_CONTACT =.TRUE.
                        NI = NEIGH_L
                        EXIT
                     ENDIF
                  ENDDO

......
......

 300           CONTINUE
            ENDDO            ! DO II = 2, NEIGHBOURS(LL,1)+I

!---------------------------------------------------------------------
! End check particle LL neighbour contacts

      ENDDO   ! end loop over paticles LL checking particle- particle contact
!$acc end parallel
!$acc end data


I have also attached the relevant compiler output below.
Code:

pgf90    -O -Mdalign -acc -ta=nvidia,time -Minfo=inline,accel -Munixlogical -c -I. -Mnosave -Mfreeform -Mrecursive -Mreentrant -byteswapio -Minline=name:des_crossprdct_2d,name:des_crossprdct_3d  ./des/calc_force_des.f
calc_force_des:
    149, Generating copy(pn(1:max_pip,1:maxneighbors))
         Generating copy(pv(1:max_pip,1:maxneighbors))
         Generating copy(pfn(1:max_pip,1:maxneighbors,1:dimn))
         Generating copy(pft(1:max_pip,1:maxneighbors,1:dimn))
         Generating copy(pea(1:max_pip,1:4))
    150, Accelerator kernel generated
        152, !$acc loop gang ! blockidx%x
        166, !$acc loop vector(256) ! threadidx%x
        168, !$acc loop vector(256) ! threadidx%x
        177, !$acc loop vector(256) ! threadidx%x
        179, !$acc loop vector(256) ! threadidx%x
        186, !$acc loop vector(256) ! threadidx%x
        188, !$acc loop vector(256) ! threadidx%x
        195, !$acc loop vector(256) ! threadidx%x
    150, Generating present_or_copy(pea(1:max_pip,1:4))
         Generating present_or_copy(pn(1:max_pip,1:maxneighbors))
         Generating present_or_copy(pfn(1:max_pip,1:maxneighbors,1:dimn))
         Generating present_or_copy(pv(1:max_pip,1:maxneighbors))
         Generating present_or_copy(pft(1:max_pip,1:maxneighbors,1:dimn))
         Generating NVIDIA code
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
         Generating compute capability 3.0 binary
    152, Scalar last value needed after loop for 'ni' at line 561
         Scalar last value needed after loop for 'ni' at line 644
         Scalar last value needed after loop for 'ni' at line 649
         Scalar last value needed after loop for 'ni' at line 647
         Scalar last value needed after loop for 'ni' at line 515
         Scalar last value needed after loop for 'ni' at line 328
         Scalar last value needed after loop for 'ni' at line 381
         Scalar last value needed after loop for 'ni' at line 387
         Scalar last value needed after loop for 'ni' at line 385
         Scalar last value needed after loop for 'ni' at line 287
    157, Complex loop carried dependence of 'pn' prevents parallelization
         Loop carried dependence due to exposed use of 'pn(i1+1,1:maxneighbors)' prevents parallelization
         Loop carried dependence of 'pn' prevents parallelization
         Loop carried backward dependence of 'pn' prevents vectorization
         Loop carried scalar dependence for 'shift' at line 160
         Loop carried dependence of 'pft' prevents parallelization
         Loop carried backward dependence of 'pft' prevents vectorization
         Loop carried dependence due to exposed use of 'pft(i1+1,1:maxneighbors,:)' prevents parallelization
         Loop carried dependence of 'pfn' prevents parallelization
         Loop carried backward dependence of 'pfn' prevents vectorization
         Loop carried dependence due to exposed use of 'pfn(i1+1,1:maxneighbors,:)' prevents parallelization
    166, Loop is parallelizable
    168, Loop is parallelizable
    176, Loop carried dependence due to exposed use of 'pn(i1+1,1:maxneighbors)' prevents parallelization
    177, Loop is parallelizable
         Loop carried reuse of 'pft' prevents parallelization
    179, Loop is parallelizable
         Loop carried reuse of 'pfn' prevents parallelization
    180, Max reduction generated for neigh_max
    186, Loop is parallelizable
    188, Loop is parallelizable
    195, Loop is parallelizable
    407, Generating copy(tmp_ax(:dimn))
    408, Accelerator kernel generated
        414, !$acc loop gang ! blockidx%x
        417, !$acc loop vector(256) ! threadidx%x
        439, !$acc loop vector(256) ! threadidx%x
        456, !$acc loop vector(256) ! threadidx%x
        468, !$acc loop vector(256) ! threadidx%x
        488, !$acc loop vector(256) ! threadidx%x
        495, !$acc loop vector(256) ! threadidx%x
        501, !$acc loop vector(256) ! threadidx%x
        503, !$acc loop vector(256) ! threadidx%x
        526, !$acc loop vector(256) ! threadidx%x
        528, !$acc loop vector(256) ! threadidx%x
        561, !$acc loop vector(256) ! threadidx%x
        563, !$acc loop vector(256) ! threadidx%x
        577, !$acc loop vector(256) ! threadidx%x
        582, !$acc loop vector(256) ! threadidx%x
        586, !$acc loop vector(256) ! threadidx%x
        593, !$acc loop vector(256) ! threadidx%x
        597, !$acc loop vector(256) ! threadidx%x
        603, !$acc loop vector(256) ! threadidx%x
        607, !$acc loop vector(256) ! threadidx%x
        616, !$acc loop vector(256) ! threadidx%x
        618, !$acc loop vector(256) ! threadidx%x
        622, !$acc loop vector(256) ! threadidx%x
        629, !$acc loop vector(256) ! threadidx%x
        630, !$acc loop vector(256) ! threadidx%x
        641, !$acc loop vector(256) ! threadidx%x
        644, !$acc loop vector(256) ! threadidx%x
        647, !$acc loop vector(256) ! threadidx%x
        649, !$acc loop vector(256) ! threadidx%x
    408, Generating present_or_copyin(des_pos_new(:,:dimn+des_pos_new$sd-1))
         Generating present_or_copyin(pea(:,1:4))
         Generating present_or_copy(pn(1:max_pip,:))
         Generating present_or_copy(fc(1:max_pip,:))
         Generating copyin(fn_tmp(:))
         Generating copyout(fn_tmp(:dimn))
         Generating present_or_copy(tang_old(:3))
         Generating present_or_copy(pft(1:max_pip,:,:))
         Generating present_or_copy(pfn(1:max_pip,:,:))
         Generating present_or_copyin(des_etat(:,:))
         Generating present_or_copyin(des_etan(:,:))
         Generating present_or_copyin(hert_kt(:,:))
         Generating present_or_copyin(hert_kn(:,:))
         Generating present_or_copyin(des_vel_old(:,:dimn+des_vel_old$sd-1))
         Generating present_or_copyin(des_pos_old(:,:dimn+des_pos_old$sd-1))
         Generating present_or_copyin(omega_new(:,:dimn+omega_new$sd-1))
         Generating present_or_copyin(des_radius(:))
         Generating present_or_copyin(des_vel_new(:,:dimn+des_vel_new$sd-1))
         Generating present_or_copy(pv(1:max_pip,:))
         Generating present_or_copyin(pijk(:,5))
         Generating present_or_copy(tmp_ax(:dimn))
         Generating copyin(tang_new(:))
         Generating copyout(tang_new(:3))
         Generating present_or_copy(ft(1:max_pip,:))
         Generating copyin(fts1(:))
         Generating copyout(fts1(:dimn))
         Generating copyin(fts2(:))
         Generating copyout(fts2(:dimn))
         Generating copyin(ft_tmp(:))
         Generating copyout(ft_tmp(:dimn))
         Generating present_or_copy(tow(1:max_pip,:))
         Generating copyin(crossp(:))
         Generating copyout(crossp(:3))
         Generating present_or_copyin(neighbours(1:max_pip,:))
         Generating NVIDIA code
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
         Generating compute capability 3.0 binary
    414, Accelerator restriction: scalar variable live-out from loop: particle_slide
         Accelerator restriction: scalar variable live-out from loop: v_rel_trans_tang
         Accelerator restriction: scalar variable live-out from loop: v_rel_trans_norm
         Accelerator restriction: scalar variable live-out from loop: distmod
    417, Loop is parallelizable
    424, Accelerator restriction: scalar variable live-out from loop: particle_slide
         Accelerator restriction: scalar variable live-out from loop: v_rel_trans_tang
         Accelerator restriction: scalar variable live-out from loop: v_rel_trans_norm
         Accelerator restriction: scalar variable live-out from loop: distmod
    431, Accelerator restriction: induction variable live-out from loop: neigh_l
    436, Accelerator restriction: induction variable live-out from loop: neigh_l
    439, Loop is parallelizable
    444, Max reduction generated for overlap_max
    456, Loop is parallelizable
    468, Loop is parallelizable
    480, des_crossprdct_3d inlined, size=8, file ./des/calc_force_des.f (827)
    483, des_crossprdct_2d inlined, size=4, file ./des/calc_force_des.f (805)
    488, Loop is parallelizable
    495, Loop is parallelizable
    501, Loop is parallelizable
    503, Loop is parallelizable
    526, Loop is parallelizable
    528, Loop is parallelizable
    561, Loop is parallelizable
    563, Loop is parallelizable
    574, des_crossprdct_3d inlined, size=8, file ./des/calc_force_des.f (827)
    577, Loop is parallelizable
    579, des_crossprdct_3d inlined, size=8, file ./des/calc_force_des.f (827)
    581, des_crossprdct_3d inlined, size=8, file ./des/calc_force_des.f (827)
    582, Loop is parallelizable
    586, Loop is parallelizable
    593, Loop is parallelizable
    597, Loop is parallelizable
    603, Loop is parallelizable
    607, Loop is parallelizable
    616, Loop is parallelizable
    618, Loop is parallelizable
    622, Loop is parallelizable
    629, Loop is parallelizable
    630, Loop is parallelizable
    635, des_crossprdct_3d inlined, size=8, file ./des/calc_force_des.f (827)
    641, Loop is parallelizable
    644, Loop is parallelizable
    647, Loop is parallelizable
    649, Loop is parallelizable
    692, Accelerator kernel generated
        694, !$acc loop gang ! blockidx%x
        702, !$acc loop vector(256) ! threadidx%x
        718, !$acc loop vector(256) ! threadidx%x
        734, !$acc loop vector(256) ! threadidx%x
        752, !$acc loop vector(256) ! threadidx%x
    692, Generating present_or_copyin(des_pos_new(:,:dimn+des_pos_new$sd-1))
         Generating present_or_copyin(des_radius(:))
         Generating present_or_copyin(pea(:,1:4))
         Generating present_or_copy(fcohesive(1:max_pip,1:dimn))
         Generating present_or_copyin(w_pos_l(1:max_pip,1:nwalls,1:dimn))
         Generating copyin(dist(:))
         Generating copyout(dist(:dimn))
         Generating present_or_copyin(neighbours(1:max_pip,:))
         Generating NVIDIA code
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
         Generating compute capability 3.0 binary
    700, Loop is parallelizable
    702, Loop is parallelizable
    718, Loop is parallelizable
    729, Loop is parallelizable
    734, Loop is parallelizable
    752, Loop is parallelizable


Finally, how to handle FORTRAN vector assignments which I think are treated as tiny (3x1) loops, like
Code:

DIST(:) = DES_POS_NEW(I,:) - DES_POS_NEW(LL,:)

Should I have to put
Code:
!$acc loop seq
before each of these?
And I'm not clear why vector(256) is used by the compiler to parallelize these tiny loops.

I can send the full code by any means you suggest.

Thanks much for your advice.
Anirban
Back to top
View user's profile
anirbanjana



Joined: 11 Aug 2012
Posts: 28

PostPosted: Wed Sep 25, 2013 10:19 am    Post subject: Reply with quote

Quick comment ... the innermost DO loop is also designated as sequential ... I posted a slightly older version of the code above where it isn't.

Code:

!$acc loop seq
                 DO NEIGH_L = 2, PN(LL,1)+1
                     IF(I.EQ. PN(LL,NEIGH_L)) THEN
                        ALREADY_IN_CONTACT =.TRUE.
                        NI = NEIGH_L
                        EXIT
                     ENDIF
                  ENDDO
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6215
Location: The Portland Group Inc.

PostPosted: Wed Sep 25, 2013 1:07 pm    Post subject: Reply with quote

Quote:
!$acc loop private(LL, PFT_TMP, DIST, DIST_CL, DIST_CI, OMEGA_SUM, VRELTRANS, V_ROT, NORMAL, TANGENT, VSLIP, &
!$acc& DTSOLID_TMP, DIST_OLD, VRN_OLD, NORMAL_OLD, sigmat, sigmat_old, FNS1, FNS2, FN, norm_old) &
How are these variables declared? Are any other large arrays?

When you privatize a variable, every thread will get it's own copy. Since you have over 18,000 threads, if any of these are larger arrays, then you'll be using up a lot of memory.

If your algorithm requires the use of private, then your other option is to use a fixed gang and vector size (i.e. use "num_gangs" and "vector_length").

- Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page 1, 2, 3  Next
Page 1 of 3

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group