|
| View previous topic :: View next topic |
| Author |
Message |
MattD
Joined: 20 Aug 2007 Posts: 6 Location: Greenville, TX
|
Posted: Mon Aug 27, 2007 8:52 am Post subject: Performance problem with SSE intrinsics |
|
|
I'm running some simple code using SSE intrinsics to get the hang of the things. And I'm having some performance problems. For a simple calculation: \vector{r} = \sqrt( \vector{x}.^2 + \vector{y}.^2 ) the version with SSE intrinsics is actually running ~2x _slower_.
I'm pretty sure it's not a simple coding issue (althought it may be a simple versioning issue?). Running the same source with gcc, the SSE version runs ~40% faster (about what I was expecting).
Is there something different that I need to do with pgcc to get the expected behavior? Any thoughts appreciated.
FYI, I'm using: pgcc 7.0-7 64-bit target on x86-64 Linux
Code snippets, compilation commands, and outputs below:
| Code: |
[delaquil@head Cforpost]$ tail -14 Cmain.c
start = clock();
ComputeArrayC( pArray1, pArray2, pResult, nSize );
end = clock();
cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;
printf ("Regular C elapsed time: %f seconds.\n", cpu_time );
start = clock();
ComputeArrayCSSE( pArray1SSE, pArray2SSE, pResultSSE, nSize );
end = clock();
cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;
printf ("C with SSE elapsed time: %f seconds.\n", cpu_time );
}
[delaquil@head Cforpost]$ /bin/cat Ctest.c
#include <math.h>
void ComputeArrayC(
float* pArray1,
float* pArray2,
float* pResult,
int nSize)
{
int i;
float* pSource1 = pArray1;
float* pSource2 = pArray2;
float* pDest = pResult;
for ( i = 0; i < nSize; i++ )
{
*pDest = (float)sqrt((*pSource1) * (*pSource1) +
(*pSource2) * (*pSource2)) + 0.5f;
pSource1++;
pSource2++;
pDest++;
}
}
[delaquil@head Cforpost]$ /bin/cat CtestSSE.c
#include<xmmintrin.h>
void ComputeArrayCSSE(
float* pArray1,
float* pArray2,
float* pResult,
int nSize)
{
int i;
int nLoop = nSize / 4;
__m128 m1, m2, m3, m4;
__m128* pSrc1 = (__m128*) pArray1;
__m128* pSrc2 = (__m128*) pArray2;
__m128* pDest = (__m128*) pResult;
__m128 m0_5 = _mm_set_ps1(0.5f);
for ( i = 0; i < nLoop; i++ )
{
m1 = _mm_mul_ps(*pSrc1, *pSrc1);
m2 = _mm_mul_ps(*pSrc2, *pSrc2);
m3 = _mm_add_ps(m1, m2);
m4 = _mm_sqrt_ps(m3);
*pDest = _mm_add_ps(m4, m0_5);
pSrc1++;
pSrc2++;
pDest++;
}
}
[delaquil@head Cforpost]$ pgcc -O4 -Mvect=sse Cmain.c Ctest.c CtestSSE.c aligned_malloc.c
Cmain.c:
Ctest.c:
CtestSSE.c:
aligned_malloc.c:
[delaquil@head Cforpost]$ ./a.out
Regular C elapsed time: 0.140000 seconds.
C with SSE elapsed time: 0.310000 seconds.
[delaquil@head Cforpost]$ gcc -lm -O4 -msse -msse2 Cmain.c Ctest.c CtestSSE.c aligned_malloc.c
[delaquil@head Cforpost]$ ./a.out
Regular C elapsed time: 0.100000 seconds.
C with SSE elapsed time: 0.060000 seconds.
[delaquil@head Cforpost]$
| [/code] |
|
| Back to top |
|
 |
mkcolg
Joined: 30 Jun 2004 Posts: 4996 Location: The Portland Group Inc.
|
Posted: Tue Aug 28, 2007 11:33 am Post subject: |
|
|
Hi Matt,
I passed this on to the engineers who develop our intrinsic libraries since they are very interest in any performance issues. However, when they tried some arbitrary size arrays with random data, they did not see the performance issue. Would it be possible to obtain the full source? If large, please send to trs@pgroup.com and have customer service forward it to me.
Thanks,
Mat |
|
| Back to top |
|
 |
mkcolg
Joined: 30 Jun 2004 Posts: 4996 Location: The Portland Group Inc.
|
Posted: Tue Aug 28, 2007 2:59 pm Post subject: |
|
|
Hi Matt,
I forwarded your code to engineering and they have been able to reproduce the performance problem. I have created a technical problem report, TPR#4285, and hopefully will be able to address the issue shortly.
Thank you for bringing this to our attention!
Mat |
|
| Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
Powered by phpBB © 2001, 2002 phpBB Group
|