PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

Strange results of profiling OpenACC code by VISUAL profiler
Goto page 1, 2  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
IrinaDDD



Joined: 24 Jan 2012
Posts: 12

PostPosted: Wed Dec 19, 2012 2:34 am    Post subject: Strange results of profiling OpenACC code by VISUAL profiler Reply with quote

Hello,

I am comparing CUDA and OpenACC versions of my code now and tried to profile codes with CUDA VISUAL profiler.
I have tried to make the codes to be as close as possible, but I am still getting different profiling results.

Here the profiling result for the CUDA code:
[img=http://s11.postimage.org/b9o9ltizz/Screen_Shot_2012_12_19_at_6_11_43_PM.jpg]

And here - for OpenACC one:
[img=http://s11.postimage.org/f8qj1rdyn/Screen_Shot_2012_12_19_at_6_03_24_PM.jpg]

Could you, please, explain me where do these small data copy calls (thin blue lines before and after each kernel) come from?

my OpenACC code looks like:

Code:


    !$acc data create( hvx, hvy, hvz, grdx, grdy, grdz),    &
    !$acc  copyin (vx,vy, vz, h) , &
    !$acc copyout (dh,dvx, dvy, dvz), &
    !$acc create (scl, omega)   

! first kernel

    !$acc kernels loop gang vector(4)  create (depth), present (CNST_EGRAV,   GRD_zs, ADM_VNONE)
     do l=1,ADM_lall
    !$acc loop gang vector(128)
       do n =1, ADM_gall
          scl(n,k,l)=&
               -( CNST_EGRAV*(h(n,k,l))          &
               +0.5D0*( vx(n,k,l)*vx(n,k,l)    &
               +vy(n,k,l)*vy(n,k,l)    &
               +vz(n,k,l)*vz(n,k,l) ) )
          depth=h(n,k,l)-GRD_zs(n,k,l,ADM_VNONE)
          hvx(n,k,l)=depth*vx(n,k,l)
          hvy(n,k,l)=depth*vy(n,k,l)
          hvz(n,k,l)=depth*vz(n,k,l)
       end do
    !$acc end kernels
  !$acc update host(scl)
    end do
   
  !Other kernels
 !$acc end data

[/code]

Thank you,

Irina.
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6211
Location: The Portland Group Inc.

PostPosted: Wed Dec 19, 2012 10:17 am    Post subject: Reply with quote

Hi Irina,

Sorry but I'm not familiar with the CUDA Visual Profiler so don't know what the different colors correspond to.

Quote:
Could you, please, explain me where do these small data copy calls (thin blue lines before and after each kernel) come from?
Do you mean the thin green lines? I only see one thin blue line around the 144000 mark.

Before the kernel is launched, there will be some overhead in looking up the addresses for the variable in the "present" clause as well as creating the global memory for "depth". Also, the complier may be copying the arguments as a separate struct in order to work around CUDA's argument size limit.

Note that it is unnecessary to copy scalar variables and in some cases can be detrimental. For example, by putting "depth" in a create clause, you have made it a global variable. Beside the performance hit of not using a register, all threads will be sharing the same "depth" variable and will most likely give you wrong answers.

How does your profile change after removing scalar variables from the various copy, create, and present clauses?

- Mat
Back to top
View user's profile
IrinaDDD



Joined: 24 Jan 2012
Posts: 12

PostPosted: Wed Dec 19, 2012 8:39 pm    Post subject: Reply with quote

Dear Mat,
Thank you for explanation.

I am sorry for not describing traces in detail.
Here the trace with some of my comments. The are thin green lines before each kernel (for example, 6 green lines around point 140150), which I was asking about.

[img][img=http://s10.postimage.org/qkid0vjsl/image.jpg][/img]

Following your advice, I have tried to delete all data regions and copy, create and present clauses, and created a new trace only for 1 kernel:

[img][img=http://s8.postimage.org/jihrh8w0x/image.jpg][/img]

So, on this trace I also have thin green lines ( for example at point 3758), which I am trying to understand.

code:
Code:

 !$acc kernels loop
    do l=1,ADM_lall
       do n =1, ADM_gall
          scl(n,k,l)=&
               -( CNST_EGRAV*(h(n,k,l))          &
               +0.5D0*( vx(n,k,l)*vx(n,k,l)    &
               +vy(n,k,l)*vy(n,k,l)    &
               +vz(n,k,l)*vz(n,k,l) ) )
          depth=h(n,k,l)-GRD_zs(n,k,l,ADM_VNONE)
          hvx(n,k,l)=depth*vx(n,k,l)
          hvy(n,k,l)=depth*vy(n,k,l)
          hvz(n,k,l)=depth*vz(n,k,l)
       end do
    end do


OpenACC compiling output:

Code:

406, Generating copyin(vz(:adm_gall,:1,:adm_lall))
         Generating copyin(vy(:adm_gall,:1,:adm_lall))
         Generating copyin(vx(:adm_gall,:1,:adm_lall))
         Generating copyin(h(:adm_gall,:1,:adm_lall))
         Generating copyout(scl(1:adm_gall,1,1:adm_lall))
         Generating copyin(grd_zs(1:adm_gall,1,1:adm_lall,1))
         Generating copyout(hvx(1:adm_gall,1,1:adm_lall))
         Generating copyout(hvy(1:adm_gall,1,1:adm_lall))
         Generating copyout(hvz(1:adm_gall,1,1:adm_lall))
    407, Loop is parallelizable
    408, Loop is parallelizable
         Accelerator kernel generated
        407, !$acc loop gang, vector(8) ! blockidx%y threadidx%y
        408, !$acc loop gang, vector(8) ! blockidx%x threadidx%x


Thank you,

Irina
Back to top
View user's profile
IrinaDDD



Joined: 24 Jan 2012
Posts: 12

PostPosted: Thu Dec 20, 2012 2:27 am    Post subject: Reply with quote

In previous trace (without data region) there were 5 thin green lines in total
Then, when I put data region to the code, the number of thin green lines became 6 (lines before point 3276)

[img=http://s10.postimage.org/byo68joxh/Screen_Shot_2012_12_20_at_6_17_27_PM.jpg]


Code:

!$acc data copyin (vx, vy, vz, GRD_zs), copyout (hvx, hvy, hvz)
    !$acc kernels loop
    do l=1,ADM_lall
       do n =1, ADM_gall
          scl(n,k,l)=&
               -( CNST_EGRAV*(h(n,k,l))          &
               +0.5D0*( vx(n,k,l)*vx(n,k,l)    &
               +vy(n,k,l)*vy(n,k,l)    &
               +vz(n,k,l)*vz(n,k,l) ) )
          depth=h(n,k,l)-GRD_zs(n,k,l,ADM_VNONE)
          hvx(n,k,l)=depth*vx(n,k,l)
          hvy(n,k,l)=depth*vy(n,k,l)
          hvz(n,k,l)=depth*vz(n,k,l)
       end do
    end do
  !$acc end data[img][/img]


Can it be because compiler copying some additional information about arrays I copy to GPU?


Thank you,

Best regards,

Irina
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6211
Location: The Portland Group Inc.

PostPosted: Thu Dec 20, 2012 12:04 pm    Post subject: Reply with quote

Hi Irina,

My best guess is these are the F90 Array descriptors. We currently send this information separate from the data. Though we are looking at consolidating this as well as making these copies asynchronous.

- Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group