PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

Is there a way to vectorize this routine?
Goto page 1, 2  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Performance and Benchmarking
View previous topic :: View next topic  
Author Message
tangoman



Joined: 23 Aug 2007
Posts: 6

PostPosted: Wed Sep 19, 2007 12:30 am    Post subject: Is there a way to vectorize this routine? Reply with quote

Hi all,

I am tring to optimize the fowllowing routine:
Code:
      PROGRAM TEST
      IMPLICIT NONE

      integer nx,ny,nz,nu
      parameter(nx=128)
      parameter(ny=128)
      parameter(nz=128)
      parameter(nu=68)

      real*4 a,b,c
      integer d
      dimension a(nz,ny,nx)
      dimension b(nz,ny,nx)
      dimension c(nu)
      dimension d(nz,ny,nx)

      integer i,j,k
      integer index

      do i=1,nx
         do j=1,ny
            do k=1,nz
               index = d(k,j,i)
               a(k,j,i) = a(k,j,i)+c(index)*b(k,j,i)
            enddo
         enddo
      enddo

      return
      end


I tried -Mvect=sse option. The optimization inforamtion says:
% pgf95 -Mvect=sse -Minfo test.f
test:
22, Unrolled inner loop 8 times
Generated 3 prefetch instructions for this loop

Is there a way to vectorize it? Thanks!
Back to top
View user's profile
MattD



Joined: 20 Aug 2007
Posts: 6
Location: Greenville, TX

PostPosted: Wed Sep 19, 2007 8:20 am    Post subject: Reply with quote

tangoman,

Yes and no. I'll tell you two ways to vectorize it, but I can't guarantee that either will be practical. I actually spent some time trying to vectorize something similar, and the best I did was break-even. (On an AMD. I haven't tried it on an Intel. And I'm not convinced they're equal for SSE stuff.)

That being said, if someone else knows how to do it more effectively, I'd love to find out.

The problem, as I see it, is that `c' is out-of-order relative to `a' and `b'. Stuff only vectorizes well if it's in order (so you can interact the 4x32-bit registers usefully). Two ways to vectorize it (both of which, I suspect, will contribute too much overhead):

1.) Invert `d' to create a map from `nu' to `nz,ny,nx'. Then re-order `a' and `b' so that they are arranged to correspond with increasing `nu'. If you have to explicitly re-order `a' and `d' after they're filled, this is probably hoplessly expensive. However, if you can re-order the initial filling of `a' and `d' this can actually work.

2.) ``Explode'' `c' into a nz*ny*nx array, replicating the values as required (`nu' << nz*ny*nx). You're actually doing that on-the-fly with c(index). But if you spend the extra resources pre-constructing it, you can vectorize trivially because `a' and `b' and `c' will all be the same size and the same order. That might even vectorize automatically in the compiler if you did the pre-work.

Option (1) will, quite-likely, require you to use intrinsics or inline assembly. Intrinsics are not too bad, but they're only included in the C/C++ compilers, so you will have to call a C function to implement them. It's not as bad as it sounds. I've never used inline assembly, so I can't say much about it, except that I never want to use it.

Good luck,

-Matt
Back to top
View user's profile
brentl



Joined: 20 Jul 2004
Posts: 108

PostPosted: Fri Sep 21, 2007 4:48 pm    Post subject: Reply with quote

We will vectorize loops with indirect addressing in PGI 7.1. If you can wait a few weeks :) YOu can write it as
a(k,j,i) = c(d(k,j,i))*b(k,j,i)
and it will vectorize. (Actually it will vectorize the way you have it there...)
Back to top
View user's profile
tangoman



Joined: 23 Aug 2007
Posts: 6

PostPosted: Sun Sep 23, 2007 7:27 pm    Post subject: Reply with quote

Thank MattD and Brentl for your kind reply. Looking forward to this feature coming soon. :-)

Another question. Is there a way to vectorize reduction operation? Somehing like this:
[/code]
do i=1, 1024
sum = sum+a(i)
enddo
[code]

Thanks!
Back to top
View user's profile
MattD



Joined: 20 Aug 2007
Posts: 6
Location: Greenville, TX

PostPosted: Mon Sep 24, 2007 10:44 am    Post subject: Reply with quote

tangoman,

The current version of pgf90 seems to vectorize that operation automatically. If your version doesn't vectorize that operation yet, you can also try the SUM(array) function for some improvement (and cleaner code).

Please see the following test code and outputs.

Code:

program test

 integer :: ii
 real :: a(1024), sum1, sum2
 double precision :: start, finish, t1, t2

 do ii = 1 , 1024
  a(ii) = sin(2*3.1416*real(ii)/1024.0)
 enddo

 t1 = 0
 do jj = 1 , 1000000
  sum1 = 0

  call cpu_time(start)
  do ii = 1 , 1024
   sum1 = sum1 + a(ii)
  enddo
  call cpu_time(finish)

  t1 = t1 + (finish-start)
 enddo

 write(*,*) 'Loop took ',t1,' microseconds'
 write(*,*) 'Error = ',sum1

 t2 = 0
 do jj = 1 , 1000000
  sum2 = 0

  call cpu_time(start)
  sum2 = sum2 + sum(a)
  call cpu_time(finish)

  t2 = t2 + (finish-start)
 enddo

 write(*,*) 'SUM function took ',t2,' microseconds'
 write(*,*) 'Error = ',sum2

end program


Compilation and output:
[delaquil@head TEMP]$ pgf90 -V

pgf90 7.0-7 64-bit target on x86-64 Linux
Copyright 1989-2000, The Portland Group, Inc. All Rights Reserved.
Copyright 2000-2007, STMicroelectronics, Inc. All Rights Reserved.
[delaquil@head TEMP]$ pgf90 -Mfree -fastsse -Mvect=nosse -Minfo test.f
test:
32, sum reduction inlined
test:
16, Loop unrolled 8 times
32, Loop unrolled 8 times
[delaquil@head TEMP]$ ./a.out
Loop took 4.492819070816040 microseconds
Error = 2.5787554E-04
SUM function took 3.693236351013184 microseconds
Error = 2.5787554E-04
[delaquil@head TEMP]$ pgf90 -Mfree -fastsse -Minfo test.f
test:
32, sum reduction inlined
test:
16, Generated vector sse code for inner loop
Generated 1 prefetch instructions for this loop
32, Generated vector sse code for inner loop
Generated 1 prefetch instructions for this loop
[delaquil@head TEMP]$ ./a.out
Loop took 2.365142345428467 microseconds
Error = 5.6102628E-05
SUM function took 2.222937107086182 microseconds
Error = 5.6102628E-05
[delaquil@head TEMP]$

-Matt
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Performance and Benchmarking All times are GMT - 7 Hours
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group