PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

Limits on vector width for large loop

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
TheMatt



Joined: 06 Jul 2009
Posts: 317
Location: Greenbelt, MD

PostPosted: Tue Feb 09, 2010 10:26 am    Post subject: Limits on vector width for large loop Reply with quote

This is just a general question, but one I figure I should know the answer to. I have a very large piece of code that I'm using on the GPUs and I've been experimenting a bit with the scheduling of the big outer loop. I'm fairly certain that the best I can do is "parallel, vector(32)" on the outer loop. Being logical and all, I then tried "parallel, vector(64)" after seeing the success of 32 and I saw in the compiler output that this was reduced to just "parallel". I guess I'm wondering, what determines this "endpoint" of vector width? I assume it has to do with the physical limits of the accelerator card (registers available would be my guess), but I'd rather know from those in the know.

Also, as an aside, whatever I do *inside* this loop never seems to matter and/or is never referred to by the compiler. That is, I can see this:
Code:
    593, Loop is parallelizable
    602, Loop is parallelizable
    656, Loop is parallelizable

on compiling. If I go into the code and add "!$acc do parallel" around those inner loops, I then see:
Code:
    594, Loop is parallelizable
    604, Loop is parallelizable
    658, Loop is parallelizable

but no explicit output saying it did parallelize it. Should I then assume that it just can't, and as such while it might be parallelizable, it's still run sequentially?

Matt
Back to top
View user's profile
Michael Wolfe



Joined: 19 Jan 2010
Posts: 42

PostPosted: Wed Feb 10, 2010 10:23 am    Post subject: Reply with quote

Hi Matt. I'm a little puzzled, but I'll try to explain what might be happening.
I'm guessing that the inner loops are not tightly nested in the big outer loop. The way the compiler works now is to follow the CUDA / OpenCL kernel model pretty closely, so only a tightly nested loop nest can be parallelized. If you have an outer loop with one or more inner loops, those inner loops can't be parallelized; each parallel loop or parallel loop nest has to be turned into a single kernel. We're looking at ways to extend the model, but there are serious limits on what the hardware can support.
Your primary question is about the vector(32) or vector(64). That's quite a bit more puzzling. I've tried to reproduce it with examples here, and was unable to do so. It shouldn't work that way, so there must be something wrong in the logic of the compiler. Your example program would really help here, but we'll keep trying to find the problem.
Thanks for the feedback.
Back to top
View user's profile
TheMatt



Joined: 06 Jul 2009
Posts: 317
Location: Greenbelt, MD

PostPosted: Wed Feb 10, 2010 12:39 pm    Post subject: Reply with quote

Dr Wolfe,

Thanks for the reply. I'm going to send a sample tarball that demonstrates this problem to trs@pgroup.com with a note to forward onto you (not too sure I can make public this code yet). When you see the code, you'll see that I fused the original code all into one big loop. I'm wondering if, for best use on accelerators, if that was wrong? The loop fusion did lower the memory needs by a lot (removed a dimension that can be order 1000 or 10000 at times) but I can't really reloop and fuse the second dimension due to some unavoidable loop dependencies. (Well, unavoidable as far as I can tell, but I can only usually spot the simple, obvious ones that can be changed.)

It is also possible the schedule I asked for (parallel, vector(32)) kills the math, as I don't seem to get very accurate results compared to original. But, "do parallel" alone leads to the same answers, and this code, at least, should be embarrassingly parallelizable across the outer loop.

As for the vector(32) and vector(64), this example should demonstrate it. At this point, I yield to your expertise!
Back to top
View user's profile
TheMatt



Joined: 06 Jul 2009
Posts: 317
Location: Greenbelt, MD

PostPosted: Fri Feb 12, 2010 11:04 am    Post subject: Reply with quote

Ah! Dr Wolfe, I might have an answer to your confusion about the vector width business. My previous attempts had all used PGI 10.1. However, with the snow abating here in DC, I was able to get 10.2 installed this morning. Upon doing so, I am now able to specify other widths for the vector statement and those are used by the compiler. (Even to the point of idiocy on my part: call to cuLaunchGrid returned error 701: Launch out of resources.)

This problem might be related to an issue I had off-forum with Mat wherein "!$acc do kernel" did not work for me previously (in 10.1, I think). This was fixed in the development kernel at that time, and perhaps might have also fixed this issue as well.
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group