PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

No array assignment replaced by call to pgf90_mcopy4 in 10.2
Goto page 1, 2  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
xray



Joined: 21 Jan 2010
Posts: 85

PostPosted: Tue Mar 09, 2010 2:23 am    Post subject: No array assignment replaced by call to pgf90_mcopy4 in 10.2 Reply with quote

Hello,

I compiled a small accelerator application with PGI 10.2 (fortran, on linux, nvidia geforce gt220) and had to recognize that it is slower than using PGI 10.1.

I figured out that this is because of an whole array assignment (in an accelerator compute region) which PGI 10.1 seemed to replace by an internal method ("Memory copy idiom, array assignment replaced by call to pgf90_mcopy4"), but PGI 10.2 generates an extra kernel with a loop for that ("Loop is parallelizable, Accelerator kernel generated, 70, !$acc do parallel, vector(16)"). This extra kernel takes 20% of my whole application GPU time (the internal function uses only 3-5% of the time).

I also just flicked through the bug fixes in PGI 10.3, but couldn't find anything about that issue.

Does anyone knows anything about that?
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6208
Location: The Portland Group Inc.

PostPosted: Tue Mar 09, 2010 8:52 am    Post subject: Reply with quote

Hi Xray,

While I don't know specifics, I can guess as to what's going on.

One complaint we have had about the Accelerator model was that it was too restrictive. If the programmer puts a region of code within the directives, then the compiler should make its best attempt to off-load this code to the GPU. (Before it would exclude sections of code which may not benefit from acceleration). In cases where idiom recognition inhibits acceleration, the idiom recognition should be disabled. While I'm not positive, this sounds like a change that would have occurred around the 10.2 release.

What I would want to know is why this extra kernel is taking longer. Does the code need to perform an extra copy? If so, can you use data regions to keep the data on the GPU? Should this section of code be left on the host? Can you use the "host" clause to tell the compiler to keep it on the host, or add an "!$acc end region"/"!$acc region" before an after this section?

Granted, it could just be a bug. Feel free to send in a report to PGI Customer Service (trs@pgroup.com) and include sample code.

Hope this helps,
Mat
Back to top
View user's profile
bbierbaum



Joined: 19 Jan 2010
Posts: 3

PostPosted: Wed Mar 10, 2010 6:23 am    Post subject: Reply with quote

Hi,

I'm working with Xray on this problem. The code is a simple jacobi solver example. We built versions with 10.1 and 10.2 and used Nvidia's Cuda Profiler to see what's going on. It seems the main compute kernel (jacobi_72_gpu) takes much longer with 10.2 for reasons we don't know and there's the additional kernel generated by 10.2 and additional memcpy calls, too. Here are the GPU times as reported by the profiler:

10.1:

Method #calls GPU usecs

jacobi_72_gpu 20 716924
jacobi_72_gpu_red 20 398.529
memcpyHtoD 62 80201.9
memcpyDtoH 21 39726.9

10.2:

jacobi_72_gpu 20 1.34378e+06
jacobi_66_gpu 20 258345
jacobi_72_gpu_red 20 398.432
memcpyHtoD 82 81118.1
memcpyDtoH 21 39487.6

This is the code:


Code:

!$acc data region local(uold) copyin(afF) copy(afU)
            do while (iIterCount < iIterMax .and. residual > fTolerance)
                residual = 0.0d0

!$acc region       

                ! Copy new solution into old
                uold = afU

!$acc do parallel private(j)
                  ! Compute stencil, residual, & update
                   do j = 1, iRows - 2
!$acc do private (i,fLRes) vector(256)
                       do i = 1, iCols - 2
                           ! Evaluate residual
                           fLRes = (ax * (uold(i-1, j) + uold(i+1, j)) &
                                  + ay * (uold(i, j-1) + uold(i, j+1)) &
                                  + b * uold(i, j) - afF(i, j)) / b
                   
                           ! Update solution
                           afU(i, j) = uold(i, j) - fRelax * fLRes
                   
                           ! Accumulate residual error
                           residual = residual + fLRes * fLRes
                       end do
                   end do
!$acc end region       

                 ! Error check
                 iIterCount = iIterCount + 1     
                 residual = SQRT(residual) / REAL(iCols * iRows)
             
            ! End iteration loop
            end do
!$acc end data region


So we are loosing performance here when going from 10.1 to 10.2. Are we doing anything too strange for the compiler?

Thanks for your help!
Boris
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6208
Location: The Portland Group Inc.

PostPosted: Wed Mar 10, 2010 11:39 am    Post subject: Reply with quote

Hi Boris,

The difference between the code generated by 10.1 and 10.2 is that the copy statement "uold = afU" was being performed on the host while in 10.2, it's been moved to the GPU. Also, in order to not get wrong answers in 10.1, the compiler is needing to copy uold and afU for each iteration of the loop.

In other words, 10.2 is correctly matching what you have in written. I just happens the 10.1 method of performing the mcopy of the host was better for your code (I'm assuming the arrays are fairly small). To recreate the 10.1 behavior, try removing the "data region" and moving the copy out of the "acc region".
Code:

!acc data region local(uold) copyin(afF) copy(afU)
            do while (iIterCount < iIterMax .and. residual > fTolerance)
                residual = 0.0d0

                ! Copy new solution into old
                uold = afU

!$acc region       
                  ! Compute stencil, residual, & update
                   do j = 1, iRows - 2
!$acc do vector(256)
                       do i = 1, iCols - 2
                           ! Evaluate residual
                           fLRes = (ax * (uold(i-1, j) + uold(i+1, j)) &
                                  + ay * (uold(i, j-1) + uold(i, j+1)) &
                                  + b * uold(i, j) - afF(i, j)) / b
                   
                           ! Update solution
                           afU(i, j) = uold(i, j) - fRelax * fLRes
                   
                           ! Accumulate residual error
                           residual = residual + fLRes * fLRes
                       end do
                   end do
!$acc end region       

                 ! Error check
                 iIterCount = iIterCount + 1     
                 residual = SQRT(residual) / REAL(iCols * iRows)
             
            ! End iteration loop
            end do
!acc end data region


Side note, scalar variables are private by default so you don't need the private clauses. It doesn't hurt, but is not necessary.

Hope this helps,
Mat
Back to top
View user's profile
bbierbaum



Joined: 19 Jan 2010
Posts: 3

PostPosted: Thu Mar 11, 2010 2:18 am    Post subject: Reply with quote

Hi Mat,

thanks for your help. I guess I now understand what's going on here, but I still can't get the same performance with 10.2 which I had with 10.1. The matrices are 5000x5000 single precision which makes them ~95 MiB each. I tried out your suggestions:

The original code reaches 3500 MFlops with 10.2 (4200 MFlops with 10.1). Removing the data region lowers the performance to 1300 MFlops. The profiler clearly shows the reason: The program now spends almost 70 % of its time doing data copies between host and device memory which is obvious, because it now needs to copy the matrices in each loop iteration. This was the reason we put the data region around the outer do-while-loop in the first place.

Additionally moving the copy out of the compute region, getting the code you posted, increases the performance a little bit to ~1400 MFlops. This can be attributed to the additional copy kernel being removed and less data copies between host and device (why's that?).

Leaving the data region in but moving the copy out of the compute region gives us around 3000 MFlops. As far as I understand, this should make 10.2 do the same as 10.1 did to the original code, but we still have significantly less performance. Looking at the profiles for this version with 10.2 and the original version with 10.1, the graphical "GPU time height" plot looks similar, but on 10.2, the main compute kernel (not the one created for the reduction) takes much longer than the data movement operations whereas for 10.1 it is the other way around. For the 10.1 version, the kernel runs in around 720000 usecs, while it needs about twice as long when compiled with 10.2. I don't uderstand why. The compiler messages look identical: same loop schedules, same size of cached references. Do you have an explanation for this?

Thanks again!
Boris
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group