PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

Six Loops iteration and reduction
Goto page Previous  1, 2, 3, 4  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
mkcolg



Joined: 30 Jun 2004
Posts: 6129
Location: The Portland Group Inc.

PostPosted: Mon Mar 19, 2012 10:27 am    Post subject: Reply with quote

Hi herohzy,

I took a look at your code but there weren't any accelerator directives in it so I was not able to reproduce your error. In looking at the loop that you want to accelerate, I see a number of issues.

First, unlike the sample code, your loops are triangular, i.e. the inner loop bounds is determined from the outer loop index. The GPU uses a rectangular grid to launch kernels so the compiler will have to translate your code to be rectangular and them put an if statement in to ignore the upper triangle. Something like:
Code:

      do 20 ikx=0,n2m
 !       do 20 iky=0,ikx
        do 20 iky=0,n2m
       if (iky .le. ikx) then
c---------------------
      ffqq1=0.d0
...


Also, "n2m" is fairly small, 40, so you don't have a lot of parallelism. Ideally, you want loop sizes in the thousands.

While I've only looked at the code a few minutes, my initial assessment is that it's better left as an OpenMP code, rather then accelerating it. Not that it can't put on a GPU, but in it's current form, only the outermost loop is parallizable, so you don't have enough parallelism to see speed-up.

Having said that, it may be possible to rearrange the algorithm to make it better suited for a GPU. Though I will leave that to you.

- Mat
Back to top
View user's profile
herohzy



Joined: 14 May 2010
Posts: 7

PostPosted: Thu Mar 22, 2012 10:25 am    Post subject: Reply with quote

Hello, Mat,
Thanks a lot for your kind help! In the sevral previous post, the code can run smoothly after allocating ffq. But there's still a problem, that is , when I set nm=50 or much greater, it just get into errors. Related to memory?

Moreover, I have sent you, again through PGI Customer Service forwarding, some programs including the acc program sent last time and the other a CUDA Fortran program I writed these days. And the problems was described in the e-mail.
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6129
Location: The Portland Group Inc.

PostPosted: Thu Mar 22, 2012 1:55 pm    Post subject: Reply with quote

Hi herohzy,

When I increase nm to 50, the code ran without error on my system, albeit very slowly. Though, my device has 6GB of memory, so yes it does appear that you simply don't have enough memory to run this program. How much memory does your device have? (See the output from the 'pgaccelinfo' utility).

Code:
c$acc region
c$acc do parallel,vector(16)
      do 19 ikx=0,n2m
c$acc do parallel,vector(16),private(ffpp),private(ffp)
        do 19 iky=0,n2m


One thing to keep in mind, when you privatize an array, every thread gets it's own copy. So, in the above code, the two private arrays are 401x401. So at a minimum, there will be n2m times n2m number of threads (there will most likely be more since there are 16x16 threads in a block). So when nm=40, n2m=20. This means that 512 threads will be created (2 blocks of 16x16). When nm=50, 768 threads are created (3 blocks).

How much memory do each of the two private arrays require?

512 threads * 401 * 401 * 8 * 2 = 1.2 GB
768 threads * 401 * 401 * 8 * 2 = 1.8 GB

I'm guessing you only have 2GB of memory?

You can only fix this by reducing the number of threads or decreasing the size of the private arrays. However, your code only uses very few threads as it is, so reducing the number of threads will only hurt your performance even more.

It looks to me that the size of ffpp and fpp is just some fixed max value and only a n2m by n2m section is actually used. Can you update your code to dynamically allocate your arrays to just use the necessary amount of memory?

Here's another section of code with issues.
Code:
c$acc region
!c$acc do private(ffkk,ffk)
      do 20 ikx=0,n2m
        do 20 iky=0,ikx
c---------------------------------
      if(wk(ikx,iky).gt.0.d0)  then

... cut

c--------------------------
      do 75 i=1,13
        ffkk(i)=ffkk(i)+ffk(i)*aa
75    continue
c--------------------------
20    continue
c$acc end region

      do 80 i=1,13
        ffkk(i)=2.d0*ffkk(i)*4.d0/nm2
80    continue


Now this code wont accelerate either due to the loop dependency on ffkk and fkk. Looks like you thought about privatizing them but commented it out. Here, you don't want to privatize ffkk since it's values are used outside of the compute region and the contents of private arrays are destroyed at the end of the compute region.

What you'd really like here is that compiler can recognise the array reductions. Unfortunately the OpenACC standard and PGI Accelerator Model only support scalar reductions. So one fix might be to change ffkk to be a 13 scalars instead of an array.


I'm also thinking about your performance. The directives are designed to work on tightly nested loops. So when you have the loop structure of:

Code:

      do 19 ikx=0,n2m
        do 19 iky=0,n2m

          do 30 iqx=0,n2m
             do 30 iqy=0,iqx

                 do 40 ipx=0,n2m
                   do 40 ipy=0,ipx

      40 continue
      30 continue
               
           do 31 iqx=0,n2m
             do 31 iqy=0,iqx

       31      continue

     ... scalar code

    19 continue



Only the outer two loops can be accelerated. Since n2m is small, you wont get much parallelism and therefore poor performance. However, if you can rearrange your code to something like:
Code:

      do 19 ikx=0,n2m
        do 19 iky=0,n2m
          do 30 iqx=0,n2m
             do 30 iqy=0,iqx
                 do 40 ipx=0,n2m
                   do 40 ipy=0,ipx
      40 continue
      30 continue
      19 continue               

      do 29 ikx=0,n2m
        do 29 iky=0,n2m
           do 31 iqx=0,n2m
             do 31 iqy=0,iqx
       31 continue
       29 continue

      do 39 ikx=0,n2m
        do 39 iky=0,n2m

     ... scalar code

      39 continue

Of course, this means adding extra dimensions to your arrays so that dependent information can be passed between the sets of loops. But I if it works, then you'll have three arrays, a 6-D, 4-D, and 2-D, that have a lot more exposed parallization. Granted, I haven't modified your code to see it will actually work, but it might be worth trying.

I'll take a look at your CUDA Fortran code next.

- Mat
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6129
Location: The Portland Group Inc.

PostPosted: Thu Mar 22, 2012 2:42 pm    Post subject: Reply with quote

Hi herohzy,

For your first issue, you said that when you access several of your argument arrays from a kernel, the program errors. For example: "wkq=wk(iqx,iqy)". One thing that would help you is to compile the code with "-Mcuda=emu -g" and run it through the PGI debugger, pgdbg. What is the value of iqx? For the first thread, it's zero and your thread is accessing memory beyond the bounds of the arrays causing the kernel to abort

For the second error you said you had a problem if you uncommented an if statement. I'm assuming you mean the following:

Code:
        if(vl1k.GT.0.0)then
        gaprk1=-vl2k/sqrt(vl1k) !2*gaprk
        else
        gaprk1=0.0
        endif
        gapr=gapr+gaprk1*cc


Now gapr is a scalar dummy argument. I'm assuming that you mean for this to perform a sum reduction across all threads? Unfortunately, this is wont work. Instead you need to add in a parallel sum reduction. I show an example of this in my article on Multi-GPU programming http://www.pgroup.com/lit/articles/insider/v3n3a2.htm.

- Mat
Back to top
View user's profile
herohzy



Joined: 14 May 2010
Posts: 7

PostPosted: Mon Mar 26, 2012 7:52 pm    Post subject: Reply with quote

Hi, Mat,
Yes, for my CUDA program, when I compiled the code with "-Mcuda=emu -g", it could ran normally, including accessing the arguments wk(:,:),bw(:,:), snb(:,:), and etc., and even gave a good-looking result(not correct yet). But how can I let it run normally on the GPU? Need I updata my GPU?(Now I use GTS450)
Thanks a million.
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page Previous  1, 2, 3, 4  Next
Page 3 of 4

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group