PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

accelerator parallization issues
Goto page 1, 2, 3, 4  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
Jerry Orosz



Joined: 02 Jan 2008
Posts: 20
Location: San Diego

PostPosted: Mon Mar 22, 2010 8:07 am    Post subject: accelerator parallization issues Reply with quote

I am new to parallel programming. I have a legacy FORTRAN-77
code (about 15,000 lines) that models eclipsing binary stars.
The stars are divided up into rectangular grids in "longitude"
and "latitude", and much of the computing time is spent
looping over the grid(s) and computing various things. Since the
GPUs don't support subroutine and function calls, I have been
starting to manually inline subroutine calls.

This section of the code computes various physical quantities for
each pixel on the star, and saves the results into several arrays.
Each pixel is independent of every other pixel, so I would think
loops like these should be able to run in parallel. At
each pixel, there is a Newton-Raphson iteration to find the radius
vector.

!$acc region
do 1104 ialf=1,Nalph
r=0.0000001d0
theta=-0.5d0*dtheta+dtheta*dble(ialf)
dphi=twopie/dble(4*Nbet) !dble(ibetlim(ialf))
snth=dsin(theta)
snth3=dsin(theta)/3.0d0
cnth=dcos(theta)
DO 1105 ibet=1, 4*Nbet
iidx=mmdx(ialf,ibet)
phi=-0.5d0*dphi+dphi*dble(ibet)
phi=phi+phistart(ialf)
cox=dcos(phi)*snth
coy=dsin(phi)*snth
coz=cnth
c begin in-line subroutine rad
do irad=1,190 !Newton-Raphson iteration
x=r*cox
y=r*coy
z=r*coz
if(itide.lt.2)then !in-line subroutine spherepot
t1=(bdist*bdist-2.0d0*cox*r*bdist+r*r)
t2=0.5d0*omega*omega*(1.0d0+overQ)*(1.0d0-coz*coz)
psi=1.0d0/r+overQ*(1.0d0/dsqrt(t1)-cox*r/(bdist*bdist))+r*r*t2
dpsidr=-1.0d0/(r*r)+overQ*((dsqrt(t1)**3)*(cox*bdist-r)
% -cox/(bdist*bdist))+t2*2.0d0*r
endif !end spherepot
rnew=r-(psi-psi0)/dpsidr
dr=dabs(rnew-r)
if(dabs(dr).lt.1.0d-19)go to 4115
r=rnew
enddo
4115 continue
if(itide.lt.2)then !in-line subroutine poten
RST = DSQRT(X**2 + Y**2 + Z**2)
RX = DSQRT((X-bdist)**2 + Y**2 + Z**2)
A = ((1.0d0+overQ)/2.0d0) * OMEGA**2
RST3 = RST*RST*RST
RX3 = RX*RX*RX
PSI = 1.0d0/RST + overQ/RX - overQ*X/bdist/bdist
& + A*(X**2 + Y**2)
PSIY = -Y/RST3 - overQ*Y/RX3 + 2.0d0*A*Y
PSIZ = -Z/RST3 - overQ*Z/RX3
PSIX = -X/RST3 - overQ*(X-bdist)/RX3 -overQ/bdist/bdist
& + 2.0d0*A*X
RST5 = RST3*RST*RST
RX5 = RX3*RX*RX
PSIXX = -1.0d0/RST3 + 3.0d0*X**2/RST5
$ -overQ/RX3 + (3.0d0*Q*(X-bdist)**2)/RX5 +2.0d0*A
endif !end poten !end rad
radarray(iidx) = R
garray(iidx) = DSQRT(PSIX**2+PSIY**2+PSIZ**2)
oneoverg=1.0d0/garray(iidx)
GRADX(iidx) = -PSIX*oneoverg
GRADY(iidx) = -PSIY*oneoverg
GRADZ(iidx) = -PSIZ*oneoverg
surf(iidx) = COX*GRADX(iidx)+COY*GRADY(iidx)
$ + COZ*GRADZ(iidx)
if(surf(iidx).lt.0.7d0)surf(iidx)=0.7d0
surf(iidx) = R**2 / surf(iidx)
surf(iidx)=surf(iidx)*dphi*dtheta*snth
xarray(iidx)=x
yarray(iidx)=y
zarray(iidx)=z
sarea=sarea+surf(iidx)
VOL = VOL + 1.0d0*R*R*R*dphi*dtheta*snth3
1105 CONTINUE ! continue ialf loop
1104 CONTINUE ! continue over ibet
!$acc end region
!$acc end data region

I compile this command:
pgfortran -Mextend -O2 -o ELC ELC.for -ta=nvidia -Minfo

I get a lengthy list of messages:
5035, No parallel kernels found, accelerator region ignored
5046, Accelerator restriction: induction variable live-out from loop: ialf
5051, Loop carried scalar dependence for 'rad1_psi' at line 5123
Loop carried scalar dependence for 'rad1_dpsidr' at line 5123
Loop carried scalar dependence for 'rad1_r' at line 5066
Loop carried scalar dependence for 'rad1_r' at line 5068
Loop carried scalar dependence for 'rad1_r' at line 5070
Loop carried scalar dependence for 'rad1_r' at line 5123
Loop carried scalar dependence for 'rad1_r' at line 5124
5061, Loop carried scalar dependence for 'rad1_psi' at line 5123
Loop carried scalar dependence for 'rad1_dpsidr' at line 5123
Loop carried scalar dependence for 'rad1_r' at line 5066
Loop carried scalar dependence for 'rad1_r' at line 5068
Loop carried scalar dependence for 'rad1_r' at line 5070
Loop carried scalar dependence for 'rad1_r' at line 5123
Loop carried scalar dependence for 'rad1_r' at line 5124
5131, Accelerator restriction: induction variable live-out from loop: ialf
5187, No parallel kernels found, accelerator region ignored
5188, Loop carried scalar dependence for 'psi' at line 5261
Loop carried scalar dependence for 'psiy' at line 5342
Loop carried scalar dependence for 'psiy' at line 5345
Loop carried scalar dependence for 'psiz' at line 5342
Loop carried scalar dependence for 'psiz' at line 5346
Loop carried scalar dependence for 'psix' at line 5342
Loop carried scalar dependence for 'psix' at line 5344
Complex loop carried dependence of 'radarray' prevents parallelization
Complex loop carried dependence of 'garray' prevents parallelization
Complex loop carried dependence of 'gradx' prevents parallelization
Complex loop carried dependence of 'grady' prevents parallelization
Complex loop carried dependence of 'gradz' prevents parallelization
Complex loop carried dependence of 'surf' prevents parallelization
Complex loop carried dependence of 'xarray' prevents parallelization
Complex loop carried dependence of 'yarray' prevents parallelization
Complex loop carried dependence of 'zarray' prevents parallelization
Scalar last value needed after loop for 'z' at line 5475
Loop carried scalar dependence for 'dpsidr' at line 5261
Accelerator restriction: scalar variable live-out from loop: r
Accelerator restriction: scalar variable live-out from loop: psi
Accelerator restriction: scalar variable live-out from loop: z
Accelerator restriction: scalar variable live-out from loop: y
Accelerator restriction: scalar variable live-out from loop: x
Accelerator restriction: scalar variable live-out from loop: psixx
Accelerator restriction: scalar variable live-out from loop: psix
Accelerator restriction: scalar variable live-out from loop: psiz
Accelerator restriction: scalar variable live-out from loop: psiy
Accelerator restriction: scalar variable live-out from loop: coz
Accelerator restriction: scalar variable live-out from loop: coy
Accelerator restriction: scalar variable live-out from loop: cox
5190, Accelerator restriction: induction variable live-out from loop: ialf
5195, Loop carried scalar dependence for 'psi' at line 5261
Loop carried scalar dependence for 'psiy' at line 5342
Loop carried scalar dependence for 'psiy' at line 5345
Loop carried scalar dependence for 'psiz' at line 5342
Loop carried scalar dependence for 'psiz' at line 5346
Loop carried scalar dependence for 'psix' at line 5342
Loop carried scalar dependence for 'psix' at line 5344
Complex loop carried dependence of 'radarray' prevents parallelization
Complex loop carried dependence of 'garray' prevents parallelization
Loop carried dependence due to exposed use of 'garray(:)' prevents parallelization
Complex loop carried dependence of 'gradx' prevents parallelization
Loop carried dependence due to exposed use of 'gradx(:)' prevents parallelization
Complex loop carried dependence of 'grady' prevents parallelization
Loop carried dependence due to exposed use of 'grady(:)' prevents parallelization
Complex loop carried dependence of 'gradz' prevents parallelization
Loop carried dependence due to exposed use of 'gradz(:)' prevents parallelization
Complex loop carried dependence of 'surf' prevents parallelization
Loop carried dependence due to exposed use of 'surf(:)' prevents parallelization
Complex loop carried dependence of 'xarray' prevents parallelization
Complex loop carried dependence of 'yarray' prevents parallelization
Complex loop carried dependence of 'zarray' prevents parallelization
Scalar last value needed after loop for 'z' at line 5475
Loop carried scalar dependence for 'dpsidr' at line 5261
Loop carried scalar dependence for 'r' at line 5210
Loop carried scalar dependence for 'r' at line 5211
Loop carried scalar dependence for 'r' at line 5212
Loop carried scalar dependence for 'r' at line 5214
Loop carried scalar dependence for 'r' at line 5216
Loop carried scalar dependence for 'r' at line 5217
Loop carried scalar dependence for 'r' at line 5261
Loop carried scalar dependence for 'r' at line 5262
Loop carried scalar dependence for 'r' at line 5341
Loop carried scalar dependence for 'r' at line 5357
Loop carried scalar dependence for 'r' at line 5375
Accelerator restriction: scalar variable live-out from loop: r
Accelerator restriction: scalar variable live-out from loop: psi
Accelerator restriction: scalar variable live-out from loop: z
Accelerator restriction: scalar variable live-out from loop: y
Accelerator restriction: scalar variable live-out from loop: x
Accelerator restriction: scalar variable live-out from loop: psixx
Accelerator restriction: scalar variable live-out from loop: psix
Accelerator restriction: scalar variable live-out from loop: psiz
Accelerator restriction: scalar variable live-out from loop: psiy
Accelerator restriction: scalar variable live-out from loop: coz
Accelerator restriction: scalar variable live-out from loop: coy
Accelerator restriction: scalar variable live-out from loop: cox
Parallelization would require privatization of array 'zarray(:)'
Parallelization would require privatization of array 'yarray(:)'
Parallelization would require privatization of array 'xarray(:)'
Parallelization would require privatization of array 'radarray(:)'
Invariant if transformation
5196, Accelerator restriction: induction variable live-out from loop: ialf
5209, Scalar last value needed after loop for 'x' at line 5268
Scalar last value needed after loop for 'x' at line 5269
Scalar last value needed after loop for 'x' at line 5273
Scalar last value needed after loop for 'x' at line 5277
Scalar last value needed after loop for 'x' at line 5281
Scalar last value needed after loop for 'x' at line 5349
Scalar last value needed after loop for 'y' at line 5268
Scalar last value needed after loop for 'y' at line 5269
Scalar last value needed after loop for 'y' at line 5273
Scalar last value needed after loop for 'y' at line 5275
Scalar last value needed after loop for 'y' at line 5350
Scalar last value needed after loop for 'z' at line 5268
Scalar last value needed after loop for 'z' at line 5269
Scalar last value needed after loop for 'z' at line 5276
Scalar last value needed after loop for 'z' at line 5351
Scalar last value needed after loop for 'z' at line 5475
Loop carried scalar dependence for 'psi' at line 5261
Loop carried scalar dependence for 'dpsidr' at line 5261
Loop carried scalar dependence for 'r' at line 5210
Loop carried scalar dependence for 'r' at line 5211
Loop carried scalar dependence for 'r' at line 5212
Loop carried scalar dependence for 'r' at line 5214
Loop carried scalar dependence for 'r' at line 5216
Loop carried scalar dependence for 'r' at line 5217
Loop carried scalar dependence for 'r' at line 5261
Loop carried scalar dependence for 'r' at line 5262
Scalar last value needed after loop for 'r' at line 5341
Scalar last value needed after loop for 'r' at line 5357
Scalar last value needed after loop for 'r' at line 5375
Accelerator restriction: scalar variable live-out from loop: r
Accelerator restriction: scalar variable live-out from loop: psi
Accelerator restriction: scalar variable live-out from loop: z
Accelerator restriction: scalar variable live-out from loop: y
Accelerator restriction: scalar variable live-out from loop: x


I suspect I need to the arrays like gradx, grady, need to be saved. However, I
don't understand the complex dependencies, like why this error occurs:

Accelerator restriction: induction variable live-out from loop: ialf

The value of the latitude of the pixel (theta) is set from the index.
The longitude (phi) is also set from an index, but the compiler does
not care about that.

In this code, I need to compute values for various pixels and save them in arrays
in a few other places, so I would appreciate any advice on how this can be
done in parallel.

I am using pgfortran 10.3. We have a Linux box with a quadcore i7 and two
Nvidia C1060 cards.

Thanks,

Jerry

P.S. upon previewing, the spacing in the code snippet is messed up.
Hopefully it is still understandable.
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6211
Location: The Portland Group Inc.

PostPosted: Mon Mar 29, 2010 3:14 pm    Post subject: Reply with quote

Hi Jerry,

Let's walk through a few of these messages for the code shown.

Quote:
Loop carried scalar dependence for 'psiy' at line 5342

The "PSI" variables are all initialized within an IF statement. In serial code this is fine since the values from the previous iteration will be used. In parallel, scalars need to be "privatized" so that each thread has their own copy and can not be dependent upon previous iterations. To fix, you need to initialize these variables outside the IF statement or in an else clause.

Quote:
Complex loop carried dependence of 'radarray' prevents parallelization
For these, the problem is your use of the calculated index "iidx". At compile time, there is no way to know that the values of "iidx=mmdx(ialf,ibet)" are independent. The compiler must assume the worst case where all values of mmdx would the same and all threads are updating the same "iidx". In this case, the actual value returned would be non-deterministic.

To fix, you'll need to create temporary results arrays to store the values for each thread. Then perform the gather operation serially on the host.

Quote:
Accelerator restriction: scalar variable live-out from loop: r

The compiler will attempt to parallelize all DO loops. However, you're using variables calculated in the innermost loop in the middle loop. This makes them "live" (the value is used on the right-hand size) and again you could have non-deterministic results depending upon which thread's value is used.

In this case, you actually want the inner most loop to be sequential. To fix, add the "do kernel" clause above the "ibet" loop:
Code:

!$acc do kernel
DO 1105 ibet=1, 4*Nbet

This tells the compiler to still parallelize the ibet loop but everything in the loop becomes the body of the device kernel code and is executed sequentially.

Hope this helps,
Mat
Back to top
View user's profile
Jerry Orosz



Joined: 02 Jan 2008
Posts: 20
Location: San Diego

PostPosted: Fri Apr 02, 2010 10:05 pm    Post subject: loop parallelized, but code is slower Reply with quote

Hi Mat,

Thanks for the help. I easily fixed two of the problems by removing
the if-statement and adding the !$acc do kernel directive.
The problem with the iidx array counter can be solved by
changing the arrays back to two dimensions. Several years ago
I made most of the 2-dimensional arrays one dimensional to
speed up the performance, and I suppose it is necessary to change
them back if these loops are to be parallelized.

I managed to get a similar loop structure to compile for parallel
execution. However, the code takes much longer to run. The run
time is (on the wall clock) is about 17 seconds when the code is
compiled for the host CPU only:

pgfortran -Mextend -O2 -o serialELC ELC.for

The run time goes up to about 35 seconds when compiled for the GPU:

pgfortran -Mextend -O2 -o ELC ELC.for -ta=nvidia,time

Here is the loop in the subroutine:

Code:

           Na=120
           Nb=40
           dtheta=pie/dble(Na)
           vol=0.0d0
!$acc region
           do 104 ialf=1,Na
             r=1.0d-12
             psi=1.0d0
             dpsidr=1.0d0
             theta=-0.5d0*dtheta+dtheta*dble(ialf)
             dphi=twopie/dble(4*Nb)
             snth=dsin(theta)
             snth3=snth*(1.0d0/3.0d0)
             cnth=dcos(theta)
!$acc do kernel
             DO 105 ibet=1,Nb*4         
               phi=-0.5d0*dphi+dphi*dble(ibet)
               cox=dcos(phi)*snth             
               coy=dsin(phi)*snth             
               coz=cnth                       
               do irad=1,190    !subroutine rad
                 x=r*cox
                 y=r*coy
                 z=r*coz
                 t1=(bdist*bdist-2.0d0*cox*r*bdist+r*r)
                 t2=0.5d0*omega*omega*(1.0d0+overQ)*(1.0d0-coz*coz)
                 psi=1.0d0/r+overQ*(1.0d0/dsqrt(t1)-cox*r/(bdist*bdist))+r*r*t2
                 dpsidr=-1.0d0/(r*r)+overQ*((dsqrt(t1)**3)*(cox*bdist-r)
     %             -cox/(bdist*bdist))+t2*2.0d0*r
                 rnew=r-(psi-psi0)/dpsidr
                 dr=dabs(rnew-r)
                 if(dabs(dr).lt.1.0d-19)go to 1115
                 r=rnew
               enddo
1115           continue   
c             
               VOL = VOL + 1.0d0*R*R*R*dphi*dtheta*snth3
105          CONTINUE    ! continue ibet loop
104        CONTINUE                   ! continue over ialf
!$acc end region


Here is the compile command:

pgfortran -Mextend -O2 -o ELC ELC.for -ta=nvidia,time -Minfo=accel

The compiler spit out these messages:

Code:

      19524, Generating compute capability 1.3 kernel
       19525, Loop is parallelizable
              Accelerator kernel generated
           19525, !$acc do parallel, vector(120)
           19562, Sum reduction generated for vol
       19536, Loop carried scalar dependence for 'r' at line 19549
              Loop carried scalar dependence for 'r' at line 19551
              Loop carried scalar dependence for 'r' at line 19552
              Loop carried scalar dependence for 'r' at line 19555
              Loop carried scalar dependence for 'r' at line 19556
              Loop carried scalar dependence for 'r' at line 19562
       19545, Loop carried scalar dependence for 'r' at line 19549
              Loop carried scalar dependence for 'r' at line 19551
              Loop carried scalar dependence for 'r' at line 19552
              Loop carried scalar dependence for 'r' at line 19555
              Loop carried scalar dependence for 'r' at line 19556
              Scalar last value needed after loop for 'r' at line 19562
              Accelerator restriction: scalar variable live-out from loop: r


I guess the compiler is relatively happy, since it made a kernel for the GPU. When I run the code, I get the following:

Code:

  setupgeo
    5382: region entered 2 times
        time(us): init=0
/home/orosz/lightcurve/./lcsubs.for
  findradius
    19524: region entered 122 times
        time(us): total=24353792 init=52636 region=24301156
                  kernels=24291961 data=9195
        w/o init: total=24301156 max=379268 min=9607 avg=199189
        19525: kernel launched 122 times
            grid: [1]  block: [120]
            time(us): total=24289311 max=379188 min=9533 avg=199092
        19562: kernel launched 122 times
            grid: [1]  block: [256] 
            time(us): total=2650 max=105 min=15 avg=21


If I am reading this correctly, the GPU spend about 24 seconds
computing the loops. As noted above, the host CPU runs the entire
code in about 17 seconds, and this includes lots of other stuff in
addition to the 122 calls to the subroutine. The time spend moving
data in and out of memory appears to be minimal, as does the
execution time.

As mentioned in the original post, the system is a quadcore i7 with
two Nvidia C1060 cards:

Code:

> pgaccelinfo
CUDA Driver Version            2030

Device Number:                 0
Device Name:                   Tesla C1060
Device Revision Number:        1.3
Global Memory Size:            4294705152
Number of Multiprocessors:     30
Number of Cores:               240
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 16384
Registers per Block:           16384
Warp Size:                     32
Maximum Threads per Block:     512
Maximum Block Dimensions:      512, 512, 64
Maximum Grid Dimensions:       65535 x 65535 x 1
Maximum Memory Pitch:          262144B
Texture Alignment              256B
Clock Rate:                    1296 MHz
Initialization time:           9430 microseconds
Current free memory            4246142976
Upload time (4MB)               891 microseconds ( 714 ms pinned)
Download time                   984 microseconds ( 731 ms pinned)
Upload bandwidth               4707 MB/sec (5874 MB/sec pinned)
Download bandwidth             4262 MB/sec (5737 MB/sec pinned)

Device Number:                 1
Device Name:                   Quadro FX 380
Device Revision Number:        1.1
Global Memory Size:            268107776
Number of Multiprocessors:     2
Number of Cores:               16
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 16384
Registers per Block:           8192
Warp Size:                     32
Maximum Threads per Block:     512
Maximum Block Dimensions:      512, 512, 64
Maximum Grid Dimensions:       65535 x 65535 x 1
Maximum Memory Pitch:          262144B
Texture Alignment              256B
Clock Rate:                    1100 MHz
Initialization time:           9430 microseconds
Current free memory            220147712
Upload time (4MB)              1535 microseconds (1357 ms pinned)
Download time                  1481 microseconds (1273 ms pinned)
Upload bandwidth               2732 MB/sec (3090 MB/sec pinned)
Download bandwidth             2832 MB/sec (3294 MB/sec pinned)

Device Number:                 2
Device Name:                   Tesla C1060
Device Revision Number:        1.3
Global Memory Size:            4294705152
Number of Multiprocessors:     30
Number of Cores:               240
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 16384
Registers per Block:           16384
Warp Size:                     32
Maximum Threads per Block:     512
Maximum Block Dimensions:      512, 512, 64
Maximum Grid Dimensions:       65535 x 65535 x 1
Maximum Memory Pitch:          262144B
Texture Alignment              256B
Clock Rate:                    1296 MHz
Initialization time:           9430 microseconds
Current free memory            4246142976
Upload time (4MB)              1515 microseconds (1345 ms pinned)
Download time                  1469 microseconds (1272 ms pinned)
Upload bandwidth               2768 MB/sec (3118 MB/sec pinned)
Download bandwidth             2855 MB/sec (3297 MB/sec pinned)


The two Tesla cards should have the capability to run double precision.
When I set the ACC_NOTIFY environment variable, I get messages like

Quote:

launch kernel file=/home/orosz/lightcurve/./lcsubs.for function=findradius line=19525 device=0 grid=1 block=120
launch kernel file=/home/orosz/lightcurve/./lcsubs.for function=findradius line=19562 device=0 grid=1 block=256


Apparently, the system is using the Tesla card at device 0 to compute
the loops, and not the video card at device=1 that runs the
screen saver.

My bottom line question: Why are these parallel loops running so slow?

Thanks in advance,

Jerry
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6211
Location: The Portland Group Inc.

PostPosted: Mon Apr 05, 2010 12:52 pm    Post subject: Reply with quote

Hi Jerry,

Next, I'd try adjusting the schedule using the "!$acc do" "parallel" and "vector" clauses. (Such as !$acc do parallel(4), vector(30)") Though, I'm not sure how much this will help.

Are you able to increase the value of "Na"? I'm concerned that 120 is too small to see any significant speed-up. While there are a lot of cores on a Tesla, individually they aren't very fast. If not, you may consider using OpenMP instead and run your code on a multi-core CPU.

- Mat
Back to top
View user's profile
Jerry Orosz



Joined: 02 Jan 2008
Posts: 20
Location: San Diego

PostPosted: Mon Apr 05, 2010 3:20 pm    Post subject: Reply with quote

mkcolg wrote:
Hi Jerry,

Next, I'd try adjusting the schedule using the "!$acc do" "parallel" and "vector" clauses. (Such as !$acc do parallel(4), vector(30)") Though, I'm not sure how much this will help.

Are you able to increase the value of "Na"? I'm concerned that 120 is too small to see any significant speed-up. While there are a lot of cores on a Tesla, individually they aren't very fast. If not, you may consider using OpenMP instead and run your code on a multi-core CPU.

- Mat


Hi Mat,

I cranked up the values of Na and Nb, but the difference between the serial and parallel versions becomes even worse.

I am wondering what exactly is being made parallel in this case. Here is the current code part. I tried the "do parallel" directives (briefly) but they did not help, so I have them commented out here (I really should read up on exactly how they are used.)


Code:

!$acc data region
!$acc region
c!$acc do parallel
           do 104 ialf=1,Na
c
             r=1.0d-12
             psi=1.0d0
             dpsidr=1.0d0
             theta=-0.5d0*dtheta+dtheta*dble(ialf)
             dphi=twopie/dble(4*Nb)
             snth=dsin(theta)
             snth3=snth*(1.0d0/3.0d0)
             cnth=dcos(theta)
c!$acc do parallel
             DO 105 ibet=1,Nb*4       
               phi=-0.5d0*dphi+dphi*dble(ibet)
               cox=dcos(phi)*snth             
               coy=dsin(phi)*snth             
               coz=cnth                       
!$acc do kernel
               do irad=1,60   !160    !subroutine rad
                 x=r*cox
                 y=r*coy
                 z=r*coz
                 t1=(bdist*bdist-2.0d0*cox*r*bdist+r*r)
                 t2=0.5d0*omega*omega*(1.0d0+overQ)*(1.0d0-coz*coz)
                 psi=1.0d0/r+overQ*(1.0d0/dsqrt(t1)-cox*r/(bdist*bdist))+r*r*t2
                 dpsidr=-1.0d0/(r*r)+overQ*((dsqrt(t1)**3)*(cox*bdist-r)
     %             -cox/(bdist*bdist))+t2*2.0d0*r
                 rnew=r-(psi-psi0)/dpsidr
                 dr=dabs(rnew-r)
c                 if(dabs(dr).lt.1.0d-15)go to 1115
                 r=rnew
               enddo
1115           continue   
               VOL = VOL + 1.0d0*R*R*R*dphi*dtheta*snth3
105          CONTINUE   

104        CONTINUE                 
!$acc end region
!$acc end data region


Here is the compiler message:

Code:

       19528, Generating compute capability 1.3 kernel
       19530, Loop is parallelizable
              Accelerator kernel generated
           19530, !$acc do parallel, vector(256)
           19562, Sum reduction generated for vol
       19541, Loop carried scalar dependence for 'r' at line 19551
              Loop carried scalar dependence for 'r' at line 19553
              Loop carried scalar dependence for 'r' at line 19554
              Loop carried scalar dependence for 'r' at line 19556
       19547, Loop carried scalar dependence for 'r' at line 19551
              Loop carried scalar dependence for 'r' at line 19553
              Loop carried scalar dependence for 'r' at line 19554
              Loop carried scalar dependence for 'r' at line 19556
              Scalar last value needed after loop for 'r' at line 19562
              Accelerator restriction: scalar variable live-out from loop: r
              Inner sequential loop scheduled on accelerator


There is still an accelerator restriction message. I am not sure if this matters with regards to the actual performance, but I suppose at some level it must, since the performance gets so bad. I have compiled and ran some of the simple test codes from the PGI pages (from the 4 part series by Michael Wolfe), and the codes run on the GPU outperforms the same code ran on the CPU.

Finally, I also commented out the statement in the innermost loop that bails when the convergence is reached. This ensures that the same number of operations are done in serial and in parallel mode. In the serial mode, only a few iterations are needed once the first iteration at a given latitude row it done. So the comparison of run times should be good since the same number of operations are done in each case.

Jerry
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page 1, 2, 3, 4  Next
Page 1 of 4

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group