PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

accelerator parallization issues
Goto page Previous  1, 2, 3, 4  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
mkcolg



Joined: 30 Jun 2004
Posts: 5871
Location: The Portland Group Inc.

PostPosted: Wed Apr 07, 2010 11:41 am    Post subject: Reply with quote

Hi Jerry,

There is some overhead in launching kernels. The exact amount per kernel call varies but is usually small (<100 ms). In you case though, since the overall run time is very small and the number of kernel calls is large (15616), this overhead is having a greater impact.

Your next step is to try and reduce the number of kernel calls. Can you add the "ialf" back to the accelerator region?

- Mat
Back to top
View user's profile
Jerry Orosz



Joined: 02 Jan 2008
Posts: 16
Location: San Diego

PostPosted: Wed Apr 07, 2010 9:42 pm    Post subject: Reply with quote

Hi Mat,

mkcolg wrote:


There is some overhead in launching kernels. The exact amount per kernel call varies but is usually small (<100 ms). In you case though, since the overall run time is very small and the number of kernel calls is large (15616), this overhead is having a greater impact.

Your next step is to try and reduce the number of kernel calls. Can you add the "ialf" back to the accelerator region?



When I move the ialf loop back into the region, the performance drops. Basically the kernel time is always similar to the time it takes the loop to run in serial mode. There are fewer kernel calls, but I think the operation count in the inner Newton-Raphson loop is always high.

Although this subroutine was a poor one for parallelization, I did learn a lot from trying. Other loops will require more function inlining and the conversion of 1-D arrays back into 2-D, etc, and as such will be more involved.

I don't know hard this is to do, but it would be very convenient to have the ability to call subroutines in the accelerator regions (I suppose this is not that easy do given that it is currently not a feature). If this is not an option, it would be nearly as neat if the same easy compiler directives could be used to make code parallel on a multicore CPU.

Jerry
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5871
Location: The Portland Group Inc.

PostPosted: Thu Apr 08, 2010 11:07 am    Post subject: Reply with quote

Hi Jerry,

Quote:
I don't know hard this is to do, but it would be very convenient to have the ability to call subroutines in the accelerator region
Calling isn't supported on GPUs since there isn't a linker. Though, you can try using "-Minline" to inline your subroutines.
Quote:

If this is not an option, it would be nearly as neat if the same easy compiler directives could be used to make code parallel on a multicore CPU.
While the PGI Accelerator model currently only supports NVIDIA, our next target will be x86 multi-core systems.

- Mat
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5871
Location: The Portland Group Inc.

PostPosted: Fri Apr 09, 2010 11:15 am    Post subject: Reply with quote

Hi Jerry,

Can you try your code again with PGI 10.4? "-ta=nvidia,fastmath" now uses a less precise but much faster divide. Another code, WRF, sees a 3x speed-up of the accelerator computation. Given the number of divides in your code, it will most likely help your code as well. Check your answers, though.

- Mat
Back to top
View user's profile
Jerry Orosz



Joined: 02 Jan 2008
Posts: 16
Location: San Diego

PostPosted: Sun Apr 11, 2010 8:40 pm    Post subject: Reply with quote

Hi Mat,

mkcolg wrote:


Can you try your code again with PGI 10.4? "-ta=nvidia,fastmath" now uses a less precise but much faster divide. Another code, WRF, sees a 3x speed-up of the accelerator computation. Given the number of divides in your code, it will most likely help your code as well. Check your answers, though.



Thanks for the pointer. A factor of 3 is nothing to sneeze at. I am working with a mathematician on improved routines to interpolate values out of a large table and to perform numerical integration, so we will definitely pay attention to the accuracy. However, using 10.3, the fastmath option did not change the numerical results, so I am hopeful.

By the way, do you have your hands on a new Nvidia Fermi card? I have my eye on the C2050, which looks like it is just been released. I would like to see some benchmarks that compare the current C1050 cards to the new ones using PGI on FORTRAN codes, both large and small, and single and double precision.

Cheers,

Jerry
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page Previous  1, 2, 3, 4  Next
Page 3 of 4

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group