PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

Six Loops iteration and reduction
Goto page 1, 2, 3, 4  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
herohzy



Joined: 14 May 2010
Posts: 7

PostPosted: Tue Mar 13, 2012 12:29 am    Post subject: Six Loops iteration and reduction Reply with quote

I'm a beginner of PVF. To accelerate my large program with GPU, I write an analogous but shorter code to take a test. In fact, it is just a three-fold(or six) loops, and every fold is an iteration and has a reduction which will be used next outer loop. And I rewrite like this,
Code:

     ffkk=0.0

     !$acc region
     do 30 ik=1,nm
     do 30 iky=1,nm
      
      ffqq=0.0
      do 201 ip=1,nm
      do 201 ipy=1,nm
         
         ffqq1=0.0
         do 10 iq=1,nm
         do 10 iqy=1,nm
            ffq=1.0/nm/nm
            ffqq1=ffqq1+ffq
            ffqq(ip,ipy)=ffqq1/2.0
10         continue
201      continue

      ffpp=0.0
      do 20 ip=1,nm
      do 20 ipy=1,nm
         ffp=ffqq(ip,ipy)/nm/nm
         ffpp=ffpp+ffp
20      continue

      ffk=ffpp/nm/nm
      ffkk=ffkk+ffk
30     continue
     !$acc end region


The result should be correct, but I think the parallelism is not very good, because of so many 'Loop is parallelizable' as follows.
Code:

prog:
     31, Generating copyout(ffqq(1:20,1:20))
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
     32, Parallelization would require privatization of array 'ffqq(1:20,i3+1)'
     33, Parallelization would require privatization of array 'ffqq(1:20,i3+1)'
         Accelerator kernel generated
         32, !$acc do seq
         33, !$acc do seq
             CC 1.0 : 11 registers; 152 shared, 20 constant, 0 local memory bytes; 33% occupancy
             CC 2.0 : 20 registers; 128 shared, 48 constant, 0 local memory bytes; 16% occupancy
         56, Sum reduction generated for ffkk
     35, Loop is parallelizable
     36, Loop is parallelizable
     37, Loop is parallelizable
     40, Loop carried scalar dependence for 'ffqq1' at line 43
         Loop carried reuse of 'ffqq' prevents parallelization
     41, Loop carried scalar dependence for 'ffqq1' at line 43
         Loop carried reuse of 'ffqq' prevents parallelization
         Inner sequential loop scheduled on accelerator
     49, Loop is parallelizable
     50, Loop is parallelizable

I am looking forward to your helpful derectives.
Back to top
View user's profile
sslgamess



Joined: 23 Nov 2009
Posts: 35

PostPosted: Tue Mar 13, 2012 1:27 am    Post subject: Reply with quote

You have some issues that needs to be address in order for them to actually be parallelized on the GPU.

40, Loop carried scalar dependence for 'ffqq1' at line 43
Loop carried reuse of 'ffqq' prevents parallelization
41, Loop carried scalar dependence for 'ffqq1' at line 43
Loop carried reuse of 'ffqq' prevents parallelization
Inner sequential loop scheduled on accelerator
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6218
Location: The Portland Group Inc.

PostPosted: Tue Mar 13, 2012 8:24 am    Post subject: Reply with quote

Hi herohzy,

Sarom is correct in that ffqq is preventing parallelization and your code is running sequentially on the device. While you know that ffqq is just a temp array used to hold intermediate values, the compiler can not assume this. To fix, you need to use the "private" clause to tell the compiler each thread has it's own private copy of the array.

For example:
Code:

     ffkk=0.0

     !$acc region
!$acc do parallel, vector(16)
     do 30 ik=1,nm
!$acc do kernel, parallel, vector(16), private(ffqq)
     do 30 iky=1,nm
     
      ffqq=0.0
      do 201 ip=1,nm
      do 201 ipy=1,nm
         
         ffqq1=0.0
         do 10 iq=1,nm
         do 10 iqy=1,nm
            ffq=1.0/nm/nm
            ffqq1=ffqq1+ffq
            ffqq(ip,ipy)=ffqq1/2.0
10         continue
201      continue

      ffpp=0.0
      do 20 ip=1,nm
      do 20 ipy=1,nm
         ffp=ffqq(ip,ipy)/nm/nm
         ffpp=ffpp+ffp
20      continue

      ffk=ffpp/nm/nm
      ffkk=ffkk+ffk
30     continue
     !$acc end region


Note that I also added the "kernel" clause to tell the compiler to only parallelize the "30" loops.

- Mat
Back to top
View user's profile
herohzy



Joined: 14 May 2010
Posts: 7

PostPosted: Tue Mar 13, 2012 11:31 pm    Post subject: Reply with quote

Hi Mat and Sarom,
Very glad for your helpful reminding as well as Mat's particular modification.

And I rebuild my program with your sugestion, but the statement
Code:

ffqq1=ffqq1+ffq

was told
Code:

Vector was used where scalar was requested

Then I modify my code like this,
Code:

     ffkk=0.0

     !$acc region
!$acc do parallel, vector(16)
     do 30 ik=1,nm
!$acc do kernel, parallel, vector(16),private(ffqq),private(ffq)
     do 30 iky=1,nm
     
      ffqq=0.0
      do 201 ip=1,nm
      do 201 ipy=1,nm
         
         ffq=0.0
         do 10 iq=1,nm
         do 10 iqy=1,nm
            ffq(iq,iqy)=1.0/nm/nm
            !ffqq1=ffqq1+ffq
            !ffqq(ip,ipy)=ffqq1/2.0
10       continue
      ffqq(ip,ipy)=sum(ffq)/2.0


201      continue

      ffpp=0.0
      do 20 ip=1,nm
      do 20 ipy=1,nm
         ffp=ffqq(ip,ipy)/nm/nm
         ffpp=ffpp+ffp
20      continue

      ffk=ffpp/nm/nm
      ffkk=ffkk+ffk
30     continue
     !$acc end region


But still it doesn't work very well

Code:

prog:
     31, Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
     33, Loop is parallelizable
     35, Loop is parallelizable
         Accelerator kernel generated
         33, !$acc do parallel, vector(16) ! blockidx%y threadidx%y
         35, !$acc do parallel, vector(16) ! blockidx%x threadidx%x
             CC 1.0 : 32 registers; 1120 shared, 48 constant, 0 local memory bytes; 33% occupancy
             CC 2.0 : 32 registers; 1032 shared, 108 constant, 0 local memory bytes; 66% occupancy
         61, Sum reduction generated for ffkk
     37, Loop is parallelizable
     38, Loop carried reuse of 'ffq' prevents parallelization
     39, Loop carried reuse of 'ffq' prevents parallelization
     41, Loop is parallelizable
     42, Loop is parallelizable
     43, Loop is parallelizable
     48, Loop is parallelizable
     54, Loop is parallelizable
     55, Loop is parallelizable

I'm confused by the parallel, vector, private, local and so on in the loops. I wonder whether the reason is the loops is so many that the compiler fails parallelizing the code.
Thanks a lot for any more help!
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6218
Location: The Portland Group Inc.

PostPosted: Wed Mar 14, 2012 9:52 am    Post subject: Reply with quote

Quote:
Vector was used where scalar was requested
This looks like a neginfo message for the host code generation warning that vector instructions were used in a scalar context. It's not relevant for your GPU version. To only see accelerator compiler feedback messages, use "-Minfo=accel".

Quote:
But still it doesn't work very well
Doesn't work well how? Performance, validation?

Your -Minfo messages look fine. The outer two loops are being parallelzied and your schedule (parallel, vector) is good. The occupancy is good at 66%, not great but good. Finally, the compiler is recognising the sum reduction. Note that you can ignore the "loop carried reuse" messages since these these just prevent the middle loops from parallelizing.

If you don't think you are getting good performance, what is the profile information (-ta=nvidia,time) telling you? Do you not have enough parallelizism (see the number of grids and blocks)? Is data movement causing the issue? Is your device initialisation time high?

Quote:
I'm confused by the parallel, vector, private, local and so on in the loops. I wonder whether the reason is the loops is so many that the compiler fails parallelizing the code.
Take a look at Michael Wolfe's Understanding the CUDA Data Parallel Threading Model: A Primer. It very helpful on giving an understand of what's going on on the device. "parallel" corresponds with a CUDA block, and a "vector" is a CUDA thread. So in your case:
Quote:
33, !$acc do parallel, vector(16) ! blockidx%y threadidx%y
35, !$acc do parallel, vector(16) ! blockidx%x threadidx%x
Has the compiler creating a two-dimensional block with a 16x16 two dimensional thread block. The number of blocks will be dynamical set at runtime depending upon the loop bounds.

"private" means that each thread will get it's own private copy of the variable. By default all scalars are private, while arrays are assumed to be shared by all threads. In your case, "ffq" is being used to hold intermediate values from the inner loop. If it were shared, all your threads would overwrite each other. Note that you could manually privatise it by adding two extra dimensions for the ik and iky loops. Note that private data is destroy at the end of the the compute region.

"local" means to allocate space for the variable but don't do any data movement. Unlike "private", the data is shared by all threads. So if you were to manually privatise "ffq", then you would want to make it "local" to avoid copying it back to the host.

Quote:
I wonder whether the reason is the loops is so many that the compiler fails parallelizing the code.
The PGI Accelerator Model currently only supports tightly nested loops, so the dependency on "ffq" will prevent the middle two loops from parallelizing. One of the things we are looking at is how to extend the Model to accommodate non-tightly nested loops but this is still in its early stages.

- Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page 1, 2, 3, 4  Next
Page 1 of 4

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group