PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

Number data copyin and copyout unexpected

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
cps



Joined: 22 Oct 2008
Posts: 2

PostPosted: Wed Apr 23, 2014 12:00 pm    Post subject: Number data copyin and copyout unexpected Reply with quote

Hello,

I'm porting an existing CFD code OpenACC and am trying to optimize the data transfers a bit.

I noticed that with PGI_ACC_TIME=1 is enabled, I get the following breakdown:

27: data region reached 11 times
27: data copyin reached 440 times
device time(us): total=1,216,222 max=2,815 min=1,992 avg=2,764
84: data copyout reached 77 times
device time(us): total=186,841 max=2,701 min=1,467 avg=2,426

My data region statement is:

*$acc data pcopyin(x, u) pcopyout(xmu)

where x, u, and xmu are multdimensional (5-d) arrays in Fortran. I expected the copyin information to occur 22 times (perhaps 33 to allow for the alloc) and the copyout to occur 11 since I'm calling the routine 11 times for the benchmark.

Any hints as to why the # of copy's aren't matching this expectation?

I also tried to replace the data region with explicit acc_copyin(), acc_create(), acc_delete() and acc_copyout() statements (along with present clause to the kernel) but that's not working properly even though the # of transfers is 3 -- that's for another post though.

Thanks for any guidance here.
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5952
Location: The Portland Group Inc.

PostPosted: Wed Apr 23, 2014 1:21 pm    Post subject: Reply with quote

Hi cps,

In order to maximize the performance of large data transfers, the run time will break the data transfer into multiple transfers. While one pinned memory buffer gets sent to the device, other buffers get filled from virtual memory. The copy from virtual to pinned memory can take time so this overlap helps speed things up.

We've debated internally what is the correct number to print here. It was decided that it should match what you would see if you had performed a CUDA profile using nvprof or nvvp. However we acknowledge that this can be confusing. I'll bring it up again.

Note, you can change the buffer size and thus reduce the number of transfers by setting the environment variable "PGI_ACC_BUFFERSIZE".

- Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group