PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Course

Data Transfer between Accelerator and Host

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling
View previous topic :: View next topic  
Author Message
THX 1138



Joined: 30 Jun 2011
Posts: 103

PostPosted: Wed Sep 21, 2011 12:23 pm    Post subject: Data Transfer between Accelerator and Host Reply with quote

In the article:

The PGI Accelerator Programming Model on NVIDA GPUs Part 2 Performance Tuning

The data section of data communications between host and device or accelerator the -Minfo messages such as

Generating copyout ([b[1:n-2][1:m-2]) are generated when using the -Minfo compiler command.

However, the programmer can be smarter that the compiler and write his own commands here.

This topic is in the section Host /Accelerator Data Movement

The commands are different depending on the case, but the partial matrices ranges are the same.

For instance matrix b from above is now

local ([b[1:n-2][1:m-2])

Since matrix b is only needed on the accelerator not the host. However, the range of elements in matrix b is exactly the same in both cases. This is a very convenient since the partial matrix is already defined for you. Is this by accident in this special case or is this the way it usually occurs?

This may be naive question, but it is not addressed in any of the literature that I have read, but it seems very clear in the examples in the PGroup literatire.

THS 1138
Back to top
View user's profile
jtull



Joined: 30 Jun 2004
Posts: 631

PostPosted: Wed Sep 21, 2011 4:05 pm    Post subject: Reply with quote

This is the explanation in the article.
============================================================
The second thing we notice is the data traffic for the a array includes a noncontiguous region. The copyin generated is for the whole matrix, but the copyout, from the GPU back to the host, only moves the modified elements, which are the interior of the array. This minimizes the data traffic, but moving noncontiguous regions is more costly than moving one large contiguous section. We can tune this by adding another clause to the region directive:

#pragma acc region local(b[1:n-2][1:m-2]) copy(a[0:n-1][0:m-1])

This tells the compiler to move the whole a array both over to the GPU and back again; it moves more data, but the moves are more efficient. The messages from the compiler are now:
==============================================================

1:n-2, 1:m-2
form the indices of the matrix minus the first and last rows and columns.

Since it would be faster to move the whole array rather than the non-contiguous
parts in the interior.
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group