PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

How to parallelize this loop...
Goto page Previous  1, 2, 3
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
PaulPa



Joined: 02 Aug 2012
Posts: 35

PostPosted: Mon Dec 17, 2012 1:00 pm    Post subject: Reply with quote

Hi Mat,

This kernel is part of a Conjugat Gradient (CG) Method and gets called several times.
I'm only measuring the time for the CG method to complete (i.e. I'm using acc_init() before starting to measure the time). Moreover, a data region keeps the data on the device and avoids unnecessary data transfers between host and device.

Here is the profiling feedback for the OpenACC version:
Code:

  matvec
    53: region entered 1631 times
        time(us): total=1,533,762 init=109 region=1,533,653
                  kernels=1,508,950
        w/o init: total=1,533,653 max=1,060 min=937 avg=940
        53: kernel launched 1631 times
            grid: [16614]  block: [32]
            time(us): total=1,508,950 max=954 min=922 avg=925


Profiling the CUDA and OpenACC version with NVIDIAs Visual Profiler shows that the CUDA implementation of the sparse matrix vector multiplication (SpMV) takes roughly 1/3 of the time of its OpenACC counterpart.

An additional note: My CUDA version is more or less a trivial translation of the OpenACC directives/clauses to CUDA.

Best,
Paul
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6211
Location: The Portland Group Inc.

PostPosted: Mon Dec 17, 2012 4:48 pm    Post subject: Reply with quote

Thanks Paul, so the difference is just with the kernel. Things to look for are how the memory is being accessed, how the inner loop reduction is being performed, if caching is being used (and if it's being used effectively).

Some quick ideas would be to:

1) disable auto-caching (-ta=nvidia,nocache)
2) Using a longer vector. 32 may be too small, try 128 or 256.
3) Accelerate just the outer loop and not perform the inner loop reduction.

My best guess would be that the overhead of the reduction code, especially with a small vector length, is hurting your performance.

If none of these seem to help, probably best for you to send me both codes so I can determine the difference.

- Mat
Back to top
View user's profile
PaulPa



Joined: 02 Aug 2012
Posts: 35

PostPosted: Mon Dec 17, 2012 5:00 pm    Post subject: Reply with quote

Hi Mat,

thanks for your advice. I'll have a closer look at it tomorrow...

Some quick remarks:
(1) I had the same idea because the Visual Profiler showed that OpenACC uses slightly more memory per block. However, disabling caching did not affect the performance at all.

(2) The CUDA version is using exactly the same schedule (i.e. 1 threadblock for each row and 32 threads for each threadblock). Increasing this value to, let's say 64, decreases performance because most rows do not even have 64 non-zeros.

(3) I'll try that tmrw but that would differ from the CUDA schedule (i.e. it might result in poor performance)

Best,
Paul
Back to top
View user's profile
PaulPa



Joined: 02 Aug 2012
Posts: 35

PostPosted: Tue Dec 18, 2012 12:37 pm    Post subject: Reply with quote

Hi again,

here are some final remarks:

(1) The nocache option has no effect at the runtime.

(2) The default compiler choice is 256 and its performance is way worse than the manual choice of 32, same goes for 128.

(3) Only parallelizing the outer loop (i.e. #pragma acc loop seq for the inner loop and leaving the schedule of the outer loop to the compiler) results in poor performance as well.

I'm very curious why this is the case. Can you tell me how you would profile this such that I can give it a try?

In any case, if you want I can send you the source code.

UPDATE: I've prepared the .tar archive containing the sorce code and the input data (36MB). How do you want me to send it, just via mail?

Best,
Paul
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6211
Location: The Portland Group Inc.

PostPosted: Tue Dec 18, 2012 3:45 pm    Post subject: Reply with quote

Hi Paul,
Quote:

I've prepared the .tar archive containing the sorce code and the input data (36MB). How do you want me to send it, just via mail?
That's probably too big to send via email. Can you FTP it? (https://www.pgroup.com/support/ftp_access.php). Please let me know when it's so I can bug our web master to get it for me.

Quote:
I'm very curious why this is the case. Can you tell me how you would profile this such that I can give it a try?
The next step would be to use CUDA Profiling and hardware counters to get a better idea where the perform differences occur. I'll also compare the CUDA we're generating with your hand tuned CUDA.

Thanks,
Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page Previous  1, 2, 3
Page 3 of 3

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group