PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

OpenACC: Problem with present directive and module array
Goto page Previous  1, 2, 3  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
mkcolg



Joined: 30 Jun 2004
Posts: 6141
Location: The Portland Group Inc.

PostPosted: Fri May 11, 2012 9:15 am    Post subject: Reply with quote

Hi Xavier,

Quote:
What are these memcpyHtoD corresponding to ? Are they also related to section descriptors like for OpenAcc. Or could they be related to scalar parameters required in the kernel ?
I doubt that it's section descriptors since that is an due to how "present" is currently implemented in the OpenACC API. Also, kernel parameters are copied as a struct argument, however it's my understanding that the time it takes to copy the parameter is included in the kernel time and not broken out as "memcpyHtoD" in the CUDA profile.

My best guess is that there are some extra arrays being copied that are not part of a data region.

Quote:
Is there a way to get a list of what is being send to the device (the -Minfo says nothings apparently).
Try setting the environment variable "NVDEBUG=1". This shows all calls made to the CUDA runtime including memory copies. It will also show the variable names that it's copying. You can tell if it's section descriptor if the variable name is followed by a "$sd".

- Mat
Back to top
View user's profile
xlapillonne



Joined: 16 Feb 2011
Posts: 69

PostPosted: Mon May 14, 2012 2:47 am    Post subject: Reply with quote

Hi Mat,

Thanks for the tip. I tried to run with NVDEBUG=1. Here is the output between 2 kernels where I don't expect any data transfer.

Code:

__pgi_cu_init( file=/project/s83/lapixa/COSMO_ICON_4.18_GPU_dev/src/soil_multilay.f90, function=terra_multlay, line=2164, startline=942, endline=5165 )
__pgi_cu_module3( lineno=2164 )
__pgi_cu_module3 module loaded at 0x7425770
__pgi_cu_module_function( name=0x117e4d0=terra_multlay_2165_gpu, lineno=2165, argname=0x0=, argsize=84, SWcachesize=0 )
Function handle is 0x74264a0
__pgi_cu_launch_a(func=0x74264a0, grid=19x1x1, block=256x1x1, lineno=2165)
__pgi_cu_launch_a(func=0x74264a0, params=0x7ffffb8e0150, bytes=84, sharedbytes=0)
First arguments are:
                          [ ... ]
__pgi_cu_close()
__pgi_cu_init( file=/project/s83/lapixa/COSMO_ICON_4.18_GPU_dev/src/soil_multilay.f90, function=terra_multlay, line=2183, startline=942, endline=5165 )
__pgi_cu_module3( lineno=2183 )
__pgi_cu_module3 module loaded at 0x7427690
__pgi_cu_module_function( name=0x11829c0=terra_multlay_2184_gpu, lineno=2184, argname=0x11829d7=a12, argsize=736, SWcachesize=0 )
Function handle is 0x7439740
__pgi_cu_uploadc( "a12", size=736, offset=0, lineno=2184 )
constant data a12 at address 0x20202200 devsize=736, size=736, offset=0
First arguments are:

                   [ ... ]   

  __pgi_cu_launch_a(func=0x7439740, grid=19x1x1, block=256x1x1, lineno=2184)
__pgi_cu_launch_a(func=0x7439740, params=0x7ffffb8e0150, bytes=0, sharedbytes=0)
__pgi_cu_close()



The nvidia profiler at this stage shows:
Code:

time stamp                 Method       GPU time  CPU time (us)
1.52021e+06   terra_multlay_2165_gpu   9.28   27
1.52062e+06   memcpyHtoD               1.184   45
1.5207e+06   terra_multlay_2184_gpu   25.152   33.152


I suppose the unexpected memcpyHtoD here is related to the "__pgi_cu_uploadc( "a12", size=736, offset=0, lineno=2184 )".
This a12 variable is apparently generated by the compiler. I suppose it may be related with some parameters. Would there be a way to place this in a data region ?

Of course the time are not very high (45 us) but is comparable to the kernel execution time. I have quite a few of them and the total sum is not negligeable at the end (some are called inside a loop).

Thanks,

Xavier
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6141
Location: The Portland Group Inc.

PostPosted: Mon May 14, 2012 9:55 am    Post subject: Reply with quote

Hi Xavier,

I think I was mistaken before. In this case, it may actually be the parameter list to the routine. CUDA limits the argument list to 256 bytes so for routines with larger argument lists, we create a struct with the arguments, copy the struct to the device, and then launch the kernel with pointer to the struct as it's arguments.

Try keeping the generated GPU code, -ta=nvidia,keepgpu, and look at the top of the file for a struct named "a12". If it's there and is the argument to your routine, then this is what's going on.

- Mat
Back to top
View user's profile
xlapillonne



Joined: 16 Feb 2011
Posts: 69

PostPosted: Tue May 15, 2012 9:27 am    Post subject: Reply with quote

Hi Mat,

I looked up in the .gpu file, but there are no a12 variable in the argument list of the kernel.

Most variables corresponds to variable in my code ( as I can see from the comment), apart from 3 integers.

Anyway what would be interesting though would be to know if I have any way to supress these data transfer, like putting scalars in a data region.

Xavier
Back to top
View user's profile
xlapillonne



Joined: 16 Feb 2011
Posts: 69

PostPosted: Mon Jul 23, 2012 2:07 am    Post subject: Reply with quote

Hi,

I am comming back to the first issue discussed in this post, concerning the overhead assoicated with the present directive in OpenAcc:

Quote:
In this case what's happening is that the section descriptors for these arrays are being copied over to the GPU each time the present clause is encountered. Michael is aware of the issue and will have his team increase the priority of removing the need copying over the section descriptor. This work had been scheduled for later this year.


Do you know whether this will be better in 12.6 ? Or could you give some time line on when we could expect some improvment regarding this aspect ?

With the partial support of the parallel construct, we were able to compile some of our OpenAcc code with PGI. For some parts however, where small kernels are called in a loop, this descriptors copy leads to 10x slower code, with respect to what we are getting with CCE.

Best regards,

Xavier
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page Previous  1, 2, 3  Next
Page 2 of 3

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group