PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

OpenACC: Problem with present directive and module array
Goto page Previous  1, 2, 3
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
mkcolg



Joined: 30 Jun 2004
Posts: 6213
Location: The Portland Group Inc.

PostPosted: Wed Jul 25, 2012 1:01 pm    Post subject: Reply with quote

Hi Xavier,

Quote:
Do you know whether this will be better in 12.6 ?
Nothing has changed w.r.t. how arguments are passed in 12.6 and Michael did not give me a time line on when they'll be able to revisit this.

Michael did mention you can try setting the environment variable "PGI_ACC_SEQDATA" to '1'. One of the various things that does is to use async data movement from pinned
memory for argument buffers. It may help here.

- Mat
Back to top
View user's profile
xlapillonne



Joined: 16 Feb 2011
Posts: 69

PostPosted: Tue Jul 31, 2012 6:00 am    Post subject: Reply with quote

Hi Mat,

Thanks for the update.

Using the PGI_ACC_SEQDATA did not help much in our case, as in the part of our code were we noticed this problem, the kernels are quite small and the timing seems to be dominated by these meta data update.


Xavier
Back to top
View user's profile
Michael Wolfe



Joined: 19 Jan 2010
Posts: 42

PostPosted: Thu Aug 02, 2012 9:48 am    Post subject: Reply with quote

Xavier: You say you see 'cuMemcpyHtoD calls. These should be cuMemcpyHtoDAsync calls, the updates should get getting done asynchronously, at least with the latest release (12.6). Now that I think on this, I'm not sure those updates were asynchronous before 12.5. There's really no workaround for sending large argument lists to a kernel except to use a memcopy. We put those argument structs into constant memory on the device side.
Back to top
View user's profile
xlapillonne



Joined: 16 Feb 2011
Posts: 69

PostPosted: Mon Aug 06, 2012 2:12 am    Post subject: Reply with quote

Hi Micheal,


Sorry it was probaly not clear in my previous post, looking at the profiler, when I go from 12.5 to 12.6 I see those unexpected cuMemcpyHtoD changing to cuMemcpyHtoDAsync. However, in the part of the code where we noticed this issue, the kernels times are comparable to the cuMemcpyHtoDAsync, so that having them asynchronous does not help very much.

The problem is that the kernels are called in a loop, so these copy are done very often

Quote:

There's really no workaround for sending large argument lists to a kernel except to use a memcopy


Note that the cuMemcpyHtoDAsync I am refering to are those associated with the present directive when using OpenAcc with PGI. What I don't fully understand is what is the fundamental difference between data create + present in OpenACC and mirror + reflected using the old PGI-directive. With "mirror + reflected" I didn't saw these additional memcopy at each kernel call, so I assume in this case the array information was send to the device at the begining of the implicit data region associated with the mirror directive. Is something similar not possible with OpenACC ?
Note also that we are not seeing additional copies when using the Cray compiler on the same code.

Xavier
Back to top
View user's profile
xlapillonne



Joined: 16 Feb 2011
Posts: 69

PostPosted: Tue Aug 14, 2012 2:32 am    Post subject: Reply with quote

Hi here is an update from the previous post:

- following an advice from Mat in an other post we removed all the "private" statment. We find out that in fact this was causing most of the performance issue; now the code is about 10 time faster. My guess is that the private directive was generating some additional cudamalloc, and since our kernels are called in a loop this was very bad (or do you have any other idea/experience ?)

- the time spent in the cuMemcpyHtoDAsync associated with the array metadata is still of the order of the kernels time, so I am assuming that it can not be fully overlaped, and it still leads to some performance penalty. In fact we found out that setting PGI_ACC_SEQDATA=1 (making the transfer sequential), is in fact faster in our case. I'm not sure why but my gess is that there is a trade off between making the transfer asynchronous and the larger time required to allocate pinned memory.

Best regards,

Xavier
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page Previous  1, 2, 3
Page 3 of 3

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group