PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

Array copy optimize
Goto page 1, 2  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Performance and Benchmarking
View previous topic :: View next topic  
Author Message
Senya



Joined: 20 Jun 2011
Posts: 58

PostPosted: Wed Dec 11, 2013 5:52 am    Post subject: Array copy optimize Reply with quote

Hello. I'm trying to optimize array copy for my program. I suggest, that sequential memory copy have to be faster, that one-by-one copy in loop. However i get opposite experimental results.

definitions:
Code:

KOL_MOMENT_MAX=2000
NN1=514,NN2=280
real, allocatable :: E0_NAKOP(:,:,:),C0_NAKOP(:,:,:),E0(:,:),C0(:,:)
allocate(E0_NAKOP(KOL_MOMENT_MAX,NN1,NN2))
allocate(C0_NAKOP(KOL_MOMENT_MAX,NN1,NN2))
allocate(E0(NN1,NN2))
allocate(C0(NN1,NN2))

first variant
Code:

do i=1,NN2
do j=1,NN1
E0_NAKOP(KOL_NAKOP,j,i)=E0(j,i)
C0_NAKOP(KOL_NAKOP,j,i)=C0(j,i)
enddo
enddo

second variant
Code:

E0_NAKOP(KOL_NAKOP,:,:)=E0(:,:)
C0_NAKOP(KOL_NAKOP,:,:)=C0(:,:)

third variant
Code:

E0_NAKOP(KOL_NAKOP,:,:)=E0
C0_NAKOP(KOL_NAKOP,:,:)=C0


Work time for function, that doing that copy is changing like this:
1. 334,7
2. 418,3
3. 538,1

As I understand it, compiler just creates some odd code, and variant 2 and 3 actually not a sequential copy, but some variant of the same loop, but even with more overheads.
How can I force compiler just to use something like C memcpy, that is the most optimal way to copy sequential data arrays?
Back to top
View user's profile
Senya



Joined: 20 Jun 2011
Posts: 58

PostPosted: Wed Dec 11, 2013 6:13 am    Post subject: Reply with quote

I suggest my problem is that sequential accessible subscripts is left, not right ones. Am I right?
Back to top
View user's profile
Senya



Joined: 20 Jun 2011
Posts: 58

PostPosted: Wed Dec 11, 2013 12:46 pm    Post subject: Reply with quote

I have tried doing
Code:

      E0_NAKOP(:,:,KOL_NAKOP)=E0(:,:)
      C0_NAKOP(:,:,KOL_NAKOP)=C0(:,:)

but this also takes more time than first variant.
I looked at generated assembly, and I see, that there is much overhead code for each array even when quick sequent copy available. That explains, why first variant is the most quick. But I still want to know, is it possible to make it work as quick as C memcpy does? Because it is told to be quicker for sequent data than simple one-by-one loop access.
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5952
Location: The Portland Group Inc.

PostPosted: Thu Dec 12, 2013 10:22 am    Post subject: Reply with quote

Hi Senya,

What optimization flags are you using?

We do perform idiom recognition and will replace array assignment with mcopy or mset where the shape of the arrays are the same and the data is contiguous.

In your first examples where KOL_NAKOP is in the first dimension, the data is not contiguous. However, the second and third variant should be faster if you use "-fast".

In the second example, where KOL_NAKOP is in the third dimension, the data is contiguous so memcopy will be used. (again with -fast).

I wrote up examples for each. From the "-Minfo", I see memcopy being used in the second example:


Code:
! first example E0_NAKOP(KOL_NAKOP,:,:)=E0(:,:)
% pgf90 senya.f90 -Minfo -fast -V13.10
foo:
     16, Memory set idiom, array assignment replaced by call to pgf90_mset4
     17, Loop not fused: function call before adjacent loop
     18, Generated vector sse code for the loop
     22, Memory copy idiom, array assignment replaced by call to pgf90_mcopy4
     24, Loop interchange produces reordered loop nest: 25,26,24
         Generated an alternate version of the loop
         Generated vector sse code for the loop
     25, Loop not fused: function call before adjacent loop
         5 loops fused
     26, Loop not fused: dependence chain to sibling loop
     33, Loop distributed: 2 new loops
         Loop interchange produces reordered loop nest: 34,34,33
         Loop interchange produces reordered loop nest: 35,35,33
         2 loops fused
         Generated an alternate version of the loop
         Generated vector sse code for the loop
     34, Loop not fused: dependence chain to sibling loop
         2 loops fused
     38, Loop distributed: 2 new loops
         Loop interchange produces reordered loop nest: 39,39,38
         Loop interchange produces reordered loop nest: 40,40,38
         2 loops fused
         Generated an alternate version of the loop
         Generated vector sse code for the loop
     39, 2 loops fused

! second example E0_NAKOP(:,:,KOL_NAKOP)=E0(:,:)
% pgf90 senya2.f90 -Minfo -fast -V13.10
foo:
     16, Memory set idiom, array assignment replaced by call to pgf90_mset4
     17, Loop not fused: function call before adjacent loop
     18, Generated vector sse code for the loop
     22, Memory copy idiom, array assignment replaced by call to pgf90_mcopy4
     24, Loop interchange produces reordered loop nest: 25,24,26
     26, Generated an alternate version of the loop
         Generated vector sse code for the loop
         Generated 2 prefetch instructions for the loop
     34, Memory copy idiom, loop replaced by call to __c_mcopy4
     35, Memory copy idiom, loop replaced by call to __c_mcopy4
     39, Memory copy idiom, loop replaced by call to __c_mcopy4
     40, Memory copy idiom, loop replaced by call to __c_mcopy4


- Mat
Back to top
View user's profile
Senya



Joined: 20 Jun 2011
Posts: 58

PostPosted: Fri Dec 13, 2013 11:18 am    Post subject: Reply with quote

Ok, thank you. That was what I need.
Just to share my thoughts.
The only thing I dislike, is that -fast implies loop vectorization, that breaks ability to debug. So if you want debugging, you have to use inoptimal loops instead of idioms replacing. It would be good to have ability either to enable idiom replacing separately or to explicitly point compiler to use it in some cases.
However, i think that idioms replacing works good enough to be put as default compiler behavior (even with optimization turned off) with option to turn it off. That how I expected compiler to act.
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Performance and Benchmarking All times are GMT - 7 Hours
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group