PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

Strangely long loop execution time
Goto page 1, 2  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
szczelba



Joined: 29 Jun 2010
Posts: 26

PostPosted: Wed Jan 26, 2011 5:52 am    Post subject: Strangely long loop execution time Reply with quote

Hello everybody,

I've got a procedure that I want to partially execute on the GPU. I've already succesfully ported some loops taking care of copying essential arrays.
One loop after porting to GPU takes approximately 25 times longer than executed on CPU. The CPU code looks like this:


Code:
        do kk=1,igfy
          hmatrix(kk,igfy)=zero
          do nbl=1,nblcks
            do n=ijklim(nbl,1),nijkpr(nbl)
              ijk=ijkpr(n)
              hmatrix(kk,igfy)=hmatrix(kk,igfy)+
     &              vvect(ijk,igfyp1)*vvect(ijk,kk)
            enddo
          enddo
          do nbl=1,nblcks
            do n=ijklim(nbl,1),nijkpr(nbl)
              ijk=ijkpr(n)
              vvect(ijk,igfyp1)=vvect(ijk,igfyp1)
     &                    -hmatrix(kk,igfy)*vvect(ijk,kk)
            enddo
          enddo
        enddo


To make it executable on GPU I've changed the code like this:

Code:

(630) !$acc region local(kk)
(631)       do kk=1,igfy
(632)          hmatrix(kk,igfy)=zero
(633)      enddo
(634)
(635)        do nbl=1,nblcks
(636)            do kk=1,igfy
(637)          do ijk=imoj4,imoj5
(638)              hmatrix(kk,igfy)=hmatrix(kk,igfy)+
(639)     &              vvect(ijk,igfyp1)*vvect(ijk,kk)
(640)            enddo
(641)          enddo
(642)      enddo
(643)
(644)        do nbl=1,nblcks
(645)          do ijk=imoj4,imoj5
(646)       do kk=1,igfy
(647)              vvect(ijk,igfyp1)=vvect(ijk,igfyp1)
(648)     &                    -hmatrix(kk,igfy)*vvect(ijk,kk)
(649)            enddo
(650)          enddo
(651)        enddo
(652)!$acc end region


Compilation log:

Quote:
630: region entered 3990 times
time(us): total=74000000
kernels=36032074 data=49510
631: kernel launched 3990 times
grid: [1] block: [256]
time(us): total=28628 max=140 min=4 avg=7
637: kernel launched 3990 times
grid: [1] block: [32]
time(us): total=35830347 max=11623 min=6040 avg=8980
646: kernel launched 3990 times
grid: [72] block: [256]
time(us): total=173099 max=131 min=15 avg=43


I've tried also to switch lines 636 and 637:

Code:
         do ijk=imoj4,imoj5
           do kk=1,igfy

but with the same results.

Why the loop on line 637 can take ~2000 times longer that the loop on line 646? Any ideas?

Thanks!
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6208
Location: The Portland Group Inc.

PostPosted: Fri Jan 28, 2011 4:51 pm    Post subject: Reply with quote

Hi szczelba,

Notice the actual schedule used for each kernel in the profile output. The loop at line 637 uses only a singe block with 32 threads. This a very poor schedule since you're only using a very small portion of your GPU. The loop at line 646 uses 72 blocks, each having 256 threads. This is much better and shows in the performance.

Can you please post the output when you compile with "-Minfo=accel"?

Thanks,
Mat
Back to top
View user's profile
szczelba



Joined: 29 Jun 2010
Posts: 26

PostPosted: Wed Feb 16, 2011 4:53 am    Post subject: Reply with quote

Sorry for long delay. I'm back in the topic.
I've compiled the program with -Minfo=accel but no more accelerator information showed up. So, I still get:

on compilation:
Code:
    635, Parallelization would require privatization of array 'hmatrix(1:igfy,igfy)'
    636, Loop carried dependence due to exposed use of 'hmatrix(1:igfy,igfy)' prevents parallelization
    637, Loop is parallelizable
         Accelerator kernel generated
        635, !$acc do seq
             Non-stride-1 accesses for array 'vvect'
        636, !$acc do seq
             Cached references to size [32] block of 'vvect'
        637, !$acc do parallel, vector(32)
             Using register for 'hmatrix'
             CC 1.3 : 18 registers; 276 shared, 188 constant, 0 local memory bytes; 25 occupancy


after execution:
Code:
630: region entered 4080 times
        time(us): total=6000000
                  kernels=1606521 data=49558
        631: kernel launched 4080 times
            grid: [1]  block: [256]
            time(us): total=24870 max=118 min=5 avg=6
        637: kernel launched 4080 times
            grid: [1]  block: [32]
            time(us): total=1548237 max=533 min=275 avg=379
        646: kernel launched 4080 times
            grid: [4]  block: [256]
            time(us): total=33414 max=90 min=4 avg=8


Still don't have any clue how to fix it. Adding "!$acc do parallel, vector(256)" does change the vector size to 256 but the calculations does not speed up at all.

Help please. I'm under pressure of time.
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6208
Location: The Portland Group Inc.

PostPosted: Wed Feb 16, 2011 11:30 am    Post subject: Reply with quote

Hi szczelba,

The informational messages indicate that the compiler reordered the loop so that 637 is the outermost and 635 and 636 are executed sequentially within the generated kernel. It's a reasonable strategy and allows for the code to be parallelized as well as takes advantage of shared memory. The caveat would be if 637's trip count is small.

What are the values for 'igfy', 'nblcks', 'imoj4', and 'imoj5'?

- Mat
Back to top
View user's profile
szczelba



Joined: 29 Jun 2010
Posts: 26

PostPosted: Fri Feb 18, 2011 2:28 am    Post subject: Reply with quote

Hi,

igfy = 10
nblcks = 1
imoj4=3000
imoj5=19000

Ok, so the compiler treated the loop on 637 as the outermost? This loop was the outermost before, but then it was impossible to parallelize the loop, so I had to switch the loops and put the 637 loop inside.
What about the loop on line 644? It seems similar, but it works fine.
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group