PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

best strategy / HowTo for nested loops

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
jockel



Joined: 15 Dec 2010
Posts: 5

PostPosted: Wed Jun 15, 2011 1:04 am    Post subject: best strategy / HowTo for nested loops Reply with quote

Hi,

we want to accelerate a CFD code with the PGI accelerator model.
Unfortuneately, I couldn't find a HowTo or extract a strategy for our code
out of the several documentations. Hopefully, someone can help me to reduce
my confusion ;-).

So, here are the facts:

- we have loops over the 3 dimensions of our grid: I,J,K with the length
of about 4 to 100 each. (loop length is not know at compile time)

- We use Fortran with the PGIAccelerator (11.5)

- We have four C2050 cards


a code without dependancies to neighbouring cells would probably look like

Code:

      do K = 1, Kend
        do J = 1, Jend
          do I = 1, Iend
            rhoV2(I,J,K) = ( urho(I,J,K,MM)* urho(I,J,K,MM)
     &           +           vrho(I,J,K,MM)* vrho(I,J,K,MM)
     &           +           wrho(I,J,K,MM)* wrho(I,J,K,MM) )
     &           /         rho(I,J,K,MM)
          end do
        end do
      end do

      do is = 1, ns
        do K = 1, Kend
          do J = 1, Jend
            do I = 1, Iend
              bet1(I,J,K)
     &             = bet1(I,J,K)
     &             + xi(I,J,K,is,MM)*Rho(I,J,K,MM) * Cvtr1(is)
            end do
          end do
        end do
      end do


a code with dependancies to neighbouring cells would probably look like

Code:

      do is = 1, nh           
       do K = 2, Kre2(mm)-1
        do J = 2, Jre2(mm)-1
         do I = 2, Ire2(mm)-1

          dxdt(i,j,k,is) = dxdt(i,j,k,is)
     &      +  XI_flux_rho_l(I,J,K,MM)
     &      *  ( xi(i_lft(I,J,K,MM),J,K,is,MM)  )
     &      +  XI_flux_rho_r(I,J,K,MM)
     &      *  ( xi(i_rght(I,J,K,MM),J,K,is,MM)  )
     &      +  ET_flux_rho_l(I,J,K,MM)
     &      *  ( xi(I,j_lft(I,J,K,MM),K,is,MM)  )
     &      +  ET_flux_rho_r(I,J,K,MM)
     &      *  ( xi(I,j_rght(I,J,K,MM),K,is,MM)  )
     &      +  ZE_flux_rho_l(I,J,K,MM)
     &      *  ( xi(I,J,k_lft(I,J,K,MM),is,MM) )
     &      +  ZE_flux_rho_r(I,J,K,MM)
     &      *  ( xi(I,J,k_rght(I,J,K,MM),is,MM)  )

         enddo
        enddo
       enddo
      enddo


where i_lft etc is the index of the left neighbour or the cell itself, depending on the fluxes.

Now, here are the questions:


1. Is there a tutorial / HowTo / ... that explains the best strategy for
parallelization with several nested loops?

2. does the first index of the variable or the last have to be the innermost loop?

3. what is a good strategy for parallelization of the I,J,K loops (the
computational grid) with respect to Multiprozessors and thread processors?

4. is "vector" length the length of the parts of the (I)-loop, that are computed sequentially on one core? And Iend/length cores are running in parallel then for the I-loop?

5. The grid block size: does it make sence to make small grid blocks that
fit in the cache or is it better to have big blocks in all 3 dimensions? Or
is it better to have long slices with large I compared to J and K -> long
vector?

6. Does it make sence to have as much code as possible within the innermost loop or is it better to divide the code and have seperate loops, all over I,J,K with only one or two statements in the innermost loops?

OK, I know that are a lot of questions, so: many thanks in advance for any hint!
Back to top
View user's profile
jtull



Joined: 30 Jun 2004
Posts: 445

PostPosted: Thu Jun 16, 2011 4:02 pm    Post subject: Accelerator Reply with quote

jockel,

There is no tutorial/HowTo that explains the best strategy to the best of our knowledge.

We have found that performance is usually better when making the "I" loop the
"vector" loop. Make the threadblocks have dimensions like (256,1,1) which correspond
to the "vector" loop. Then make the "J" and "K" loops the parallel loops and make them
your grid dimensions.

regards,
dave
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6208
Location: The Portland Group Inc.

PostPosted: Tue Jun 21, 2011 4:10 pm    Post subject: Reply with quote

Hi Jockel,

Quote:
1. Is there a tutorial / HowTo / ... that explains the best strategy for
parallelization with several nested loops?

What might be useful for you is to take a look at our paper and source code for the Himeno benchmark which you can find in the PGI Accelerator Files Page (http://www.pgroup.com/resources/accel_files/index.htm) near the bottom. Himeno is a 3D Poisson solver and has a similar structure to your code.

Quote:

2. does the first index of the variable or the last have to be the innermost loop?
Short answer: Yes.

The first index should correspond to the "vector" in order to get stride-1 memory accesses across threads. Typically, the innermost loop does correspond to at least one dimension of the vector.

Quote:
3. what is a good strategy for parallelization of the I,J,K loops (the
computational grid) with respect to Multiprozessors and thread processors?
I would first let the compiler determine the schedule and then start making adjustments from there. Though, it's been my experience that the compiler does usually find the optimal schedule and remove my directives. Most likely, it will try to create a 16x16 vector (thread block) from the I and J loops and then create a 2-D parallel grid (blocks) from K and J.

Note for the second loop, I would move the "is" loop to the innermost, running it sequentially in your kernel. As it is now, it's not parallel. Something like:
Code:
     
        do K = 1, Kend
          do J = 1, Jend
!$acc do kernel
            do I = 1, Iend   
             do is = 1, ns
              bet1(I,J,K)
     &             = bet1(I,J,K)
     &             + xi(I,J,K,is,MM)*Rho(I,J,K,MM) * Cvtr1(is)
            end do
          end do
        end do
      end do


Note, that for the third example loop, you do not want to move is to the innermost since you added fourth dimension.

Quote:
4. is "vector" length the length of the parts of the (I)-loop, that are computed sequentially on one core? And Iend/length cores are running in parallel then for the I-loop?
No. On an NVIDIA GPU, "vector" corresponds to the number of threads in a thread block. I recommend reading this article on the CUDA Data Parallel Model (http://www.pgroup.com/lit/articles/insider/v2n1a5.htm).

Quote:
5. The grid block size: does it make sence to make small grid blocks that
fit in the cache or is it better to have big blocks in all 3 dimensions? Or
is it better to have long slices with large I compared to J and K -> long
vector?
6. Does it make sence to have as much code as possible within the innermost loop or is it better to divide the code and have seperate loops, all over I,J,K with only one or two statements in the innermost loops?
I'm going to lump these two questions together. The question of large kernel versus small kernel is not easily answered since it is very dependent upon your algorithm.

Ideally, you want to achieve 100% "occupancy" (http://forums.nvidia.com/index.php?showtopic=31279) which is the ratio of active warps versus the maximum number supported by your GPU. The limiting factors are the amount to shared memory available, the number of registers, and the number of threads per block. Smaller kernels tend to use less resources, therefore you can increase the number of threads per block, and increase the occupancy. However, small kernels tend to mean that you need to run more of them in order to solve your algorithm.

With larger kernels, you tend to use more resources so can not run as many threads per block and reduces the occupancy. However, doing more work per kernel will mean fewer kernels and less overhead.

Hope this helps,
Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group