PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

accelerator parallization issues
Goto page Previous  1, 2, 3, 4  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
mkcolg



Joined: 30 Jun 2004
Posts: 6125
Location: The Portland Group Inc.

PostPosted: Mon Apr 05, 2010 4:11 pm    Post subject: Reply with quote

Hi Jerry,

The "ibet" loop isn't parallelizable due to 'R' so the compiler is scheduling it sequentially on the accelerator. It's essentially the same as you using the "kernel" directive from before.

One thing to try is "-ta=nvidia,fastmath". This will use less precise but much faster math intrinsics. Also, in 10.4 (due out later this week), "fastmath" will use a less precise divide. Given that your code uses many divides, this could have a large performance impact

- Mat
Back to top
View user's profile
Jerry Orosz



Joined: 02 Jan 2008
Posts: 20
Location: San Diego

PostPosted: Mon Apr 05, 2010 5:55 pm    Post subject: Reply with quote

Hi Mat,

mkcolg wrote:


The "ibet" loop isn't parallelizable due to 'R' so the compiler is scheduling it sequentially on the accelerator. It's essentially the same as you using the "kernel" directive from before.



I am having trouble understanding the restriction. The grid elements on the star are all independent. I took the same code, and commented out the "do irad" loop start and stop commands. Everything from "x=r*cox" down to "r=rnew" is done only once. The compiler tells me this:

Code:

       19528, Generating compute capability 1.3 kernel
       19530, Loop is parallelizable
              Accelerator kernel generated
           19530, !$acc do parallel, vector(256)
           19563, Sum reduction generated for vol
       19541, Loop is parallelizable


Does this mean that both of the ialf and ibet loops are now parallel? If so, then in principle I should be able to cut and paste the code inside the irad loop and hardwire the loop manually. I just tried this with the code repeated twice and with the code repeated three times and I got the same compiler message.

Rather than cut and paste 160 times, is there a way to tell the compiler to "unroll" that loop specifically, and to not worry about the scalar dependence on r? Most cases need only 4 or 5 trips through the loop, but there are cases that need a lot more.

Quote:


One thing to try is "-ta=nvidia,fastmath". This will use less precise but much faster math intrinsics. Also, in 10.4 (due out later this week), "fastmath" will use a less precise divide. Given that your code uses many divides, this could have a large performance impact

- Mat


The fastmath option made a marginal difference without any change in the output files. This is an option to keep in mind when tuning things. I will look forward to the updated release.

This forum has been a big help. Thanks for running it.

Thanks,

Jerry
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6125
Location: The Portland Group Inc.

PostPosted: Tue Apr 06, 2010 3:34 pm    Post subject: Reply with quote

Hi Jerry,

"R" is initialized in the outermost loop (ialf) and can change in the innermost loop (irad). Hence, the starting value of "R" for each iteration of the middle loop (ibet) will be depend upon the previous iteration. This dependency prevents the ibet loop from being parallelized.

You can fix this by initializing R inside the middle loop, but your answers could be different.

Quote:
Does this mean that both of the ialf and ibet loops are now parallel?
It appears so.

Quote:
Rather than cut and paste 160 times, is there a way to tell the compiler to "unroll" that loop specifically,
Not yet, but soon. We're in the process of implementing a the "unroll" clause for "!$acc do" directive.

Though, trying to unroll a loop 160 times may cause other problems such as using too many registers. Instead, try moving the initialization of R inside the ibet loop and check that you're still getting correct answers.
Quote:

This forum has been a big help. Thanks for running it.
You're welcome.

- Mat
Back to top
View user's profile
Jerry Orosz



Joined: 02 Jan 2008
Posts: 20
Location: San Diego

PostPosted: Tue Apr 06, 2010 10:55 pm    Post subject: Reply with quote

Hi Mat,

I think I have everything sorted out. I went ahead and cut and paste the code inside the irad loop about 100 times (it is not like I have to put a quarter in the slot for each paste). The compiler seemed happy and made both loops parallel.

The problem is that it turns out that this subroutine really is not a good candidate for parallelization. In serial mode, it is reasonably efficient. At each ialf, the initial operation count to get the radius might be a few dozen trips through the Newton-Raphson loop. However, once this is done, each iteration in the ibet loop uses only a few trips, since the radius does not change much from one pixel to the next. When both loops are parallel, the operation count in the Newton-Raphson loop can be quite large for all pixels since a global initial guess is used.

Next, I tried to make only the ibet loop parallel. At each ialf, I find the initial r, and use this as the initial guess inside the ibet loop. This actually works. However, to see a time savings on the wall clock requires a very large value of Nb. For example, if Nb=960, the parallel kernel finishes in about 4.5 seconds. In serial mode it takes 25 seconds for all of the calls to the subroutine to be completed. When Nb=3200, the times are 7.8 and 84 seconds, respectively. Finally, running it into the ground, Na=9600 gives times of 16 seconds and 259 seconds, respectively.

In normal use, a value of Nb=48 is sufficient. The time for the parallel kernel is 3.64 seconds (2.8 accelerator kernel time). In serial mode the subroutine takes 1.1 seconds. By playing around with the

!$acc do parallel, vector()

directive, I can get the kernel time down to about 1.5 seconds. However, the total region time always seems to be stuck at about 2.5 seconds. The initialization of the card takes 0.07 seconds, and about 0.7 seconds is used to move data between the host and the card.

For large Nb, the host CPU time is 10 or more times longer than the GPU kernel time. Why is this not the case for small Nb? Why won't the region time go down to about 0.8 seconds (the initialization time and the data moving time)?

Thanks,

Jerry
Back to top
View user's profile
Jerry Orosz



Joined: 02 Jan 2008
Posts: 20
Location: San Diego

PostPosted: Wed Apr 07, 2010 9:52 am    Post subject: Reply with quote

As a postscript, here is the output the program gives after running (I compiled using -ta=nvidia,time):

For Nb=48:

Code:

Accelerator Kernel Timing data
/home/orosz/lightcurve/./lcsubs.for
  setupgeo
    5382: region entered 2 times
        time(us): init=0
/home/orosz/lightcurve/./lcsubs.for
  findradius
    19545: region entered 15616 times
        time(us): total=3564618 init=50706 region=3513912
                  kernels=2779746 data=734166
        w/o init: total=3513912 max=9789 min=126 avg=225
        19547: kernel launched 15616 times
            grid: [1]  block: [192]
            time(us): total=2540559 max=9712 min=63 avg=162
        21715: kernel launched 15616 times
            grid: [1]  block: [256]
            time(us): total=239187 max=70 min=14 avg=15


For Nb=240:

Code:

Accelerator Kernel Timing data
/home/orosz/lightcurve/./lcsubs.for
  setupgeo
    5382: region entered 2 times
        time(us): init=0
/home/orosz/lightcurve/./lcsubs.for
  findradius
    19545: region entered 15616 times
        time(us): total=4028584 init=47359 region=3981225
                  kernels=3265664 data=715561
        w/o init: total=3981225 max=1429 min=136 avg=254
        19547: kernel launched 15616 times
            grid: [4]  block: [256]
            time(us): total=3010606 max=1367 min=75 avg=192
        21715: kernel launched 15616 times
            grid: [1]  block: [256]
            time(us): total=255058 max=107 min=15 avg=16


Nb=480:

Code:

Accelerator Kernel Timing data
/home/orosz/lightcurve/./lcsubs.for
  setupgeo
    5382: region entered 2 times
        time(us): init=0
/home/orosz/lightcurve/./lcsubs.for
  findradius
    19545: region entered 15616 times
        time(us): total=4311237 init=53790 region=4257447
                  kernels=3519184 data=738263
        w/o init: total=4257447 max=1768 min=138 avg=272
        19547: kernel launched 15616 times
            grid: [8]  block: [256]
            time(us): total=3264275 max=1364 min=76 avg=209
        21715: kernel launched 15616 times
            grid: [1]  block: [256]
            time(us): total=254909 max=312 min=15 avg=16


In round numbers, it appears always to take about 0.7 seconds to move data, and about 0.3 seconds to do the parallel sum. For small Nb, the actual kernel compute time seems longer than it should be, based on the performance for large Nb. There seems to be some "start-up" cost that is not accounted for in the above numbers.

Jerry
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page Previous  1, 2, 3, 4  Next
Page 2 of 4

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group