|
| View previous topic :: View next topic |
| Author |
Message |
IrinaDDD
Joined: 24 Jan 2012 Posts: 12
|
Posted: Wed Dec 19, 2012 2:34 am Post subject: Strange results of profiling OpenACC code by VISUAL profiler |
|
|
Hello,
I am comparing CUDA and OpenACC versions of my code now and tried to profile codes with CUDA VISUAL profiler.
I have tried to make the codes to be as close as possible, but I am still getting different profiling results.
Here the profiling result for the CUDA code:
[img=http://s11.postimage.org/b9o9ltizz/Screen_Shot_2012_12_19_at_6_11_43_PM.jpg]
And here - for OpenACC one:
[img=http://s11.postimage.org/f8qj1rdyn/Screen_Shot_2012_12_19_at_6_03_24_PM.jpg]
Could you, please, explain me where do these small data copy calls (thin blue lines before and after each kernel) come from?
my OpenACC code looks like:
| Code: |
!$acc data create( hvx, hvy, hvz, grdx, grdy, grdz), &
!$acc copyin (vx,vy, vz, h) , &
!$acc copyout (dh,dvx, dvy, dvz), &
!$acc create (scl, omega)
! first kernel
!$acc kernels loop gang vector(4) create (depth), present (CNST_EGRAV, GRD_zs, ADM_VNONE)
do l=1,ADM_lall
!$acc loop gang vector(128)
do n =1, ADM_gall
scl(n,k,l)=&
-( CNST_EGRAV*(h(n,k,l)) &
+0.5D0*( vx(n,k,l)*vx(n,k,l) &
+vy(n,k,l)*vy(n,k,l) &
+vz(n,k,l)*vz(n,k,l) ) )
depth=h(n,k,l)-GRD_zs(n,k,l,ADM_VNONE)
hvx(n,k,l)=depth*vx(n,k,l)
hvy(n,k,l)=depth*vy(n,k,l)
hvz(n,k,l)=depth*vz(n,k,l)
end do
!$acc end kernels
!$acc update host(scl)
end do
!Other kernels
!$acc end data
| [/code]
Thank you,
Irina. |
|
| Back to top |
|
 |
mkcolg
Joined: 30 Jun 2004 Posts: 4996 Location: The Portland Group Inc.
|
Posted: Wed Dec 19, 2012 10:17 am Post subject: |
|
|
Hi Irina,
Sorry but I'm not familiar with the CUDA Visual Profiler so don't know what the different colors correspond to.
| Quote: | | Could you, please, explain me where do these small data copy calls (thin blue lines before and after each kernel) come from? | Do you mean the thin green lines? I only see one thin blue line around the 144000 mark.
Before the kernel is launched, there will be some overhead in looking up the addresses for the variable in the "present" clause as well as creating the global memory for "depth". Also, the complier may be copying the arguments as a separate struct in order to work around CUDA's argument size limit.
Note that it is unnecessary to copy scalar variables and in some cases can be detrimental. For example, by putting "depth" in a create clause, you have made it a global variable. Beside the performance hit of not using a register, all threads will be sharing the same "depth" variable and will most likely give you wrong answers.
How does your profile change after removing scalar variables from the various copy, create, and present clauses?
- Mat |
|
| Back to top |
|
 |
IrinaDDD
Joined: 24 Jan 2012 Posts: 12
|
Posted: Wed Dec 19, 2012 8:39 pm Post subject: |
|
|
Dear Mat,
Thank you for explanation.
I am sorry for not describing traces in detail.
Here the trace with some of my comments. The are thin green lines before each kernel (for example, 6 green lines around point 140150), which I was asking about.
[img][img=http://s10.postimage.org/qkid0vjsl/image.jpg][/img]
Following your advice, I have tried to delete all data regions and copy, create and present clauses, and created a new trace only for 1 kernel:
[img][img=http://s8.postimage.org/jihrh8w0x/image.jpg][/img]
So, on this trace I also have thin green lines ( for example at point 3758), which I am trying to understand.
code:
| Code: |
!$acc kernels loop
do l=1,ADM_lall
do n =1, ADM_gall
scl(n,k,l)=&
-( CNST_EGRAV*(h(n,k,l)) &
+0.5D0*( vx(n,k,l)*vx(n,k,l) &
+vy(n,k,l)*vy(n,k,l) &
+vz(n,k,l)*vz(n,k,l) ) )
depth=h(n,k,l)-GRD_zs(n,k,l,ADM_VNONE)
hvx(n,k,l)=depth*vx(n,k,l)
hvy(n,k,l)=depth*vy(n,k,l)
hvz(n,k,l)=depth*vz(n,k,l)
end do
end do
|
OpenACC compiling output:
| Code: |
406, Generating copyin(vz(:adm_gall,:1,:adm_lall))
Generating copyin(vy(:adm_gall,:1,:adm_lall))
Generating copyin(vx(:adm_gall,:1,:adm_lall))
Generating copyin(h(:adm_gall,:1,:adm_lall))
Generating copyout(scl(1:adm_gall,1,1:adm_lall))
Generating copyin(grd_zs(1:adm_gall,1,1:adm_lall,1))
Generating copyout(hvx(1:adm_gall,1,1:adm_lall))
Generating copyout(hvy(1:adm_gall,1,1:adm_lall))
Generating copyout(hvz(1:adm_gall,1,1:adm_lall))
407, Loop is parallelizable
408, Loop is parallelizable
Accelerator kernel generated
407, !$acc loop gang, vector(8) ! blockidx%y threadidx%y
408, !$acc loop gang, vector(8) ! blockidx%x threadidx%x
|
Thank you,
Irina |
|
| Back to top |
|
 |
IrinaDDD
Joined: 24 Jan 2012 Posts: 12
|
Posted: Thu Dec 20, 2012 2:27 am Post subject: |
|
|
In previous trace (without data region) there were 5 thin green lines in total
Then, when I put data region to the code, the number of thin green lines became 6 (lines before point 3276)
[img=http://s10.postimage.org/byo68joxh/Screen_Shot_2012_12_20_at_6_17_27_PM.jpg]
| Code: |
!$acc data copyin (vx, vy, vz, GRD_zs), copyout (hvx, hvy, hvz)
!$acc kernels loop
do l=1,ADM_lall
do n =1, ADM_gall
scl(n,k,l)=&
-( CNST_EGRAV*(h(n,k,l)) &
+0.5D0*( vx(n,k,l)*vx(n,k,l) &
+vy(n,k,l)*vy(n,k,l) &
+vz(n,k,l)*vz(n,k,l) ) )
depth=h(n,k,l)-GRD_zs(n,k,l,ADM_VNONE)
hvx(n,k,l)=depth*vx(n,k,l)
hvy(n,k,l)=depth*vy(n,k,l)
hvz(n,k,l)=depth*vz(n,k,l)
end do
end do
!$acc end data[img][/img] |
Can it be because compiler copying some additional information about arrays I copy to GPU?
Thank you,
Best regards,
Irina |
|
| Back to top |
|
 |
mkcolg
Joined: 30 Jun 2004 Posts: 4996 Location: The Portland Group Inc.
|
Posted: Thu Dec 20, 2012 12:04 pm Post subject: |
|
|
Hi Irina,
My best guess is these are the F90 Array descriptors. We currently send this information separate from the data. Though we are looking at consolidating this as well as making these copies asynchronous.
- Mat |
|
| Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
Powered by phpBB © 2001, 2002 phpBB Group
|