PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

Transposing 2dim Array

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
elephant



Joined: 24 Feb 2011
Posts: 22

PostPosted: Fri May 06, 2011 2:48 am    Post subject: Transposing 2dim Array Reply with quote

Hello

In order to get stride-1-access I have to transpose some 2dim arrays of my code. Lets say I have an array of size A(6,2'000'000) and I want to transpose it. Is it better to do this on the host or is there a way to make it fast on the device using the accelerator model?
If I do it just like that:
Code:

!$acc region
     do i=1,6
        do j=1,knend
            A_transposed(j,i)=A(i,j)
        end do
     end do
!$acc end region


I get very poor performance.
Is there an efficient way of transposing arrays on the device using the PGI Accelerator model?

Thank you!
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6138
Location: The Portland Group Inc.

PostPosted: Fri May 06, 2011 9:50 am    Post subject: Reply with quote

Hi elephant,

My guess as to why you are seen poor performance is due to extra host to device copies. Are you using data regions? Try doing something like the following where A is only copied to and from the device once and A_transposed is local to the device so never copied.

Code:
% cat transpose.f90

program trans
implicit none
real, allocatable, dimension(:,:) :: A, A_transposed
integer :: i, j, knend
knend=200000
allocate(A(6,knend), A_transposed(knend,6))
A=1.0

!$acc data region copy(A), local(A_transposed)
! Copy A to the device and create a A_transposed locally on the device

!$acc region
     do i=1,6
        do j=1,knend
            A_transposed(j,i)=A(i,j)
        end do
     end do
!$acc end region

!$acc region
     do i=1,6
        do j=1,knend
            A_transposed(j,i)=A_transposed(j,i)*6.0
        end do
     end do
!$acc end region

!$acc region
     do i=1,6
        do j=1,knend
            A(i,j)=A_transposed(j,i)
        end do
     end do
!$acc end region
!$acc end data region 
! A is copied back the host here

print *, A(3,100)
deallocate(A, A_transposed)

end program trans


Hope this helps,
Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group