|
| View previous topic :: View next topic |
| Author |
Message |
tangoman
Joined: 23 Aug 2007 Posts: 6
|
Posted: Thu Aug 23, 2007 2:33 am Post subject: How can I make -Mvect=sse and -mp work togehter? |
|
|
Hi all,
I am trying to tune a numerical computing program with openmp on multi core AMD machine. I found the program with -mp option is much slower than the one without -mp when it runs with one thread. I post a simple test as following:
| Code: |
!$OMP PARALLEL
!$OMP DO PRIVATE(i,j,k)
do i=1,nx
do j=1,ny
do k=1,nz
tmp = c0*(a(k-4,j,i)+a(k+4,j,i))
& + c1*(a(k-3,j,i)+a(k+3,j,i))
& + c2*(a(k-2,j,i)+a(k+2,j,i))
& + c3*(a(k-1,j,i)+a(k+1,j,i))
& + c4*a(k,j,i)
b(k,j,i) = b(k,j,i)+c5*tmp
enddo
enddo
enddo
!$OMP END PARALLEL
I use –Minfo option to display compile-time optimization listings. It seems that the option -Mvect=sse conflits with -mp. The defference shows as following:
pgf90 -tp k8-64 -fastsse -Minfo -Mneginfo -c -o test.o test.f
my_test:
19, Generated 3 alternate loops for the inner loop
Generated vector sse code for inner loop
Generated 2 prefetch instructions for this loop
Generated vector sse code for inner loop
Generated 2 prefetch instructions for this loop
Generated vector sse code for inner loop
Generated 2 prefetch instructions for this loop
Generated vector sse code for inner loop
Generated 2 prefetch instructions for this loop
pgf90 -tp k8-64 -fastsse -mp -Minfo -Mneginfo -c -o test.o test.f
my_test:
15, Parallel region activated
17, Parallel loop activated; static block iteration allocation
19, Unrolled inner loop 8 times
Generated 2 prefetch instructions for this loop
29, Barrier
Parallel region terminated
How can I make them work togehter? Any suggestion is welcome.
Thanks!
|
|
|
| Back to top |
|
 |
brentl
Joined: 20 Jul 2004 Posts: 107
|
Posted: Fri Aug 24, 2007 5:44 pm Post subject: |
|
|
| You might need to declare tmp to be private. |
|
| Back to top |
|
 |
tangoman
Joined: 23 Aug 2007 Posts: 6
|
Posted: Sun Aug 26, 2007 10:56 pm Post subject: |
|
|
Yes, I made a mistake here. Thanks, brentl.
After I declared tmp as private, the optimization information is still a little different from the one without -mp flag.
15, Parallel region activated
17, Parallel loop activated; static block iteration allocation
19, Generated an alternate loop for the inner loop
Generated vector sse code for inner loop
Generated 2 prefetch instructions for this loop
Generated vector sse code for inner loop
Generated 2 prefetch instructions for this loop
29, Barrier
Parallel region terminated
Any suggestion? |
|
| Back to top |
|
 |
brentl
Joined: 20 Jul 2004 Posts: 107
|
Posted: Thu Sep 13, 2007 11:02 am Post subject: |
|
|
| Our altcode generator makes decisions based on a number of factors, being in a parallel region among them. That is why the differences. If you find it makes a big performance difference, you should let us know. Since the code vectorizes in both cases now, the code should be running fairly well. |
|
| Back to top |
|
 |
tangoman
Joined: 23 Aug 2007 Posts: 6
|
Posted: Wed Sep 19, 2007 12:32 am Post subject: |
|
|
| Thanks, these two version run almost at the same speed. |
|
| Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
Powered by phpBB © 2001, 2002 phpBB Group
|