PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

CUDA and Exponentiation

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
TheMatt



Joined: 06 Jul 2009
Posts: 322
Location: Greenbelt, MD

PostPosted: Tue Aug 17, 2010 5:13 am    Post subject: CUDA and Exponentiation Reply with quote

I think I asked something about this long ago, but I thought it might be useful to ask now that CUDA Fortran is here.

Namely, what is the best way to handle exponentiation in CUDA Fortran? I'm currently trying to transform an old F77 code into CUDA Fortran, and so I'm encountering many of those tricks you see in hand-optimized code. For example, in my code, a variable is raised to the 58th power by:
Code:
yy=x*x
yy1=yy*yy
yy1=yy1*yy1
yy2=yy1*yy1
yy3=yy2*yy2
y=yy*yy1*yy2*yy3
which uses 8 multiplies instead of a pow operation, which is probably a faster route than
Code:
y=x**58
when you are using a CPU. But, I began to wonder if, with CUDA, I'd be better off using the ** operator? First, perhaps the __powf call that might be used (if fastmath maps ** onto that, not sure) might execute in less cycles? Or, perhaps, that the reduced register pressure might be good in a sprawling code like mine (I use maxregcount so all extra registers go to global memory)?

I ask mainly because there are many hand-coded **8, **6, and even a **58 and a **21 in this program. True, the best answer is try both out and benchmark, but at the moment this code won't compile with -Mcuda due to too much local memory use (I'm working on cutting that down) and some weird issue with constant memory that might need a bug report (the __constant__ struct has hundreds, thousands of entries...which isn't right).

Matt

PS: Actually, I slightly lied. The **21 I mention above? It looks like it must be done the multiply route. Turns out the compiler gives equal results whether I use **58 or the multiplies above, but **21 is done differently than the multiply method in the current code, leading to floating-point differences.
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6208
Location: The Portland Group Inc.

PostPosted: Tue Aug 17, 2010 4:08 pm    Post subject: Reply with quote

Hi Matt,

Sorry, I have not looked at the performance of pow on a GPU so don't know which is better.

What I do know is that the compiler will generate code that does the multiplication method for x**y where y <= 10. For values of y greater than 10, a 'powf(x,y)' (or pow is x is double precision) call is generated. If using "-Mcuda=fastmath", 'powf' is replaced with the faster but less accurate "__powf" function.

Quote:
Actually, I slightly lied. The **21 I mention above? It looks like it must be done the multiply route. Turns out the compiler gives equal results whether I use **58 or the multiplies above, but **21 is done differently than the multiply method in the current code, leading to floating-point differences
With **21 the compiler is calling the 'powf' routine which is accurate to 5ulps (not very accurate). I don't know the accuracy of '__powf', but would assume it's worse.

- Mat
Back to top
View user's profile
TheMatt



Joined: 06 Jul 2009
Posts: 322
Location: Greenbelt, MD

PostPosted: Wed Aug 18, 2010 6:11 am    Post subject: Reply with quote

mkcolg wrote:
Hi Matt,

Sorry, I have not looked at the performance of pow on a GPU so don't know which is better.

What I do know is that the compiler will generate code that does the multiplication method for x**y where y <= 10. For values of y greater than 10, a 'powf(x,y)' (or pow is x is double precision) call is generated. If using "-Mcuda=fastmath", 'powf' is replaced with the faster but less accurate "__powf" function.
So, for x**8, even the GPU code will use the multiply method? Or is that only for CPU?
Quote:
Quote:
Actually, I slightly lied. The **21 I mention above? It looks like it must be done the multiply route. Turns out the compiler gives equal results whether I use **58 or the multiplies above, but **21 is done differently than the multiply method in the current code, leading to floating-point differences
With **21 the compiler is calling the 'powf' routine which is accurate to 5ulps (not very accurate). I don't know the accuracy of '__powf', but would assume it's worse.

Huh. I guess I'm surprised then that **58 is the same either way (that is x**58 gives the same answer as the multiply route on a CPU). Perhaps, though, that isn't being used or, with such a large exponent, the number is butting up against the single-precision limit near 0 or 1? Hmm.

Of course, all this will be answered if and when I can get the GPU code to compile. Though I'm guessing many a forum post and support email will occur before then!

Thanks,
Matt
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6208
Location: The Portland Group Inc.

PostPosted: Wed Aug 18, 2010 9:19 am    Post subject: Reply with quote

Hi Matt,

Quote:
So, for x**8, even the GPU code will use the multiply method? Or is that only for CPU?
Both actually.

Quote:
Of course, all this will be answered if and when I can get the GPU code to compile. Though I'm guessing many a forum post and support email will occur before then!

Thankfully we don't charge on a per question basis, so keep'm coming!

- Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group