|
| View previous topic :: View next topic |
| Author |
Message |
Jerry Orosz
Joined: 02 Jan 2008 Posts: 12 Location: San Diego
|
Posted: Mon Mar 22, 2010 8:07 am Post subject: accelerator parallization issues |
|
|
I am new to parallel programming. I have a legacy FORTRAN-77
code (about 15,000 lines) that models eclipsing binary stars.
The stars are divided up into rectangular grids in "longitude"
and "latitude", and much of the computing time is spent
looping over the grid(s) and computing various things. Since the
GPUs don't support subroutine and function calls, I have been
starting to manually inline subroutine calls.
This section of the code computes various physical quantities for
each pixel on the star, and saves the results into several arrays.
Each pixel is independent of every other pixel, so I would think
loops like these should be able to run in parallel. At
each pixel, there is a Newton-Raphson iteration to find the radius
vector.
!$acc region
do 1104 ialf=1,Nalph
r=0.0000001d0
theta=-0.5d0*dtheta+dtheta*dble(ialf)
dphi=twopie/dble(4*Nbet) !dble(ibetlim(ialf))
snth=dsin(theta)
snth3=dsin(theta)/3.0d0
cnth=dcos(theta)
DO 1105 ibet=1, 4*Nbet
iidx=mmdx(ialf,ibet)
phi=-0.5d0*dphi+dphi*dble(ibet)
phi=phi+phistart(ialf)
cox=dcos(phi)*snth
coy=dsin(phi)*snth
coz=cnth
c begin in-line subroutine rad
do irad=1,190 !Newton-Raphson iteration
x=r*cox
y=r*coy
z=r*coz
if(itide.lt.2)then !in-line subroutine spherepot
t1=(bdist*bdist-2.0d0*cox*r*bdist+r*r)
t2=0.5d0*omega*omega*(1.0d0+overQ)*(1.0d0-coz*coz)
psi=1.0d0/r+overQ*(1.0d0/dsqrt(t1)-cox*r/(bdist*bdist))+r*r*t2
dpsidr=-1.0d0/(r*r)+overQ*((dsqrt(t1)**3)*(cox*bdist-r)
% -cox/(bdist*bdist))+t2*2.0d0*r
endif !end spherepot
rnew=r-(psi-psi0)/dpsidr
dr=dabs(rnew-r)
if(dabs(dr).lt.1.0d-19)go to 4115
r=rnew
enddo
4115 continue
if(itide.lt.2)then !in-line subroutine poten
RST = DSQRT(X**2 + Y**2 + Z**2)
RX = DSQRT((X-bdist)**2 + Y**2 + Z**2)
A = ((1.0d0+overQ)/2.0d0) * OMEGA**2
RST3 = RST*RST*RST
RX3 = RX*RX*RX
PSI = 1.0d0/RST + overQ/RX - overQ*X/bdist/bdist
& + A*(X**2 + Y**2)
PSIY = -Y/RST3 - overQ*Y/RX3 + 2.0d0*A*Y
PSIZ = -Z/RST3 - overQ*Z/RX3
PSIX = -X/RST3 - overQ*(X-bdist)/RX3 -overQ/bdist/bdist
& + 2.0d0*A*X
RST5 = RST3*RST*RST
RX5 = RX3*RX*RX
PSIXX = -1.0d0/RST3 + 3.0d0*X**2/RST5
$ -overQ/RX3 + (3.0d0*Q*(X-bdist)**2)/RX5 +2.0d0*A
endif !end poten !end rad
radarray(iidx) = R
garray(iidx) = DSQRT(PSIX**2+PSIY**2+PSIZ**2)
oneoverg=1.0d0/garray(iidx)
GRADX(iidx) = -PSIX*oneoverg
GRADY(iidx) = -PSIY*oneoverg
GRADZ(iidx) = -PSIZ*oneoverg
surf(iidx) = COX*GRADX(iidx)+COY*GRADY(iidx)
$ + COZ*GRADZ(iidx)
if(surf(iidx).lt.0.7d0)surf(iidx)=0.7d0
surf(iidx) = R**2 / surf(iidx)
surf(iidx)=surf(iidx)*dphi*dtheta*snth
xarray(iidx)=x
yarray(iidx)=y
zarray(iidx)=z
sarea=sarea+surf(iidx)
VOL = VOL + 1.0d0*R*R*R*dphi*dtheta*snth3
1105 CONTINUE ! continue ialf loop
1104 CONTINUE ! continue over ibet
!$acc end region
!$acc end data region
I compile this command:
pgfortran -Mextend -O2 -o ELC ELC.for -ta=nvidia -Minfo
I get a lengthy list of messages:
5035, No parallel kernels found, accelerator region ignored
5046, Accelerator restriction: induction variable live-out from loop: ialf
5051, Loop carried scalar dependence for 'rad1_psi' at line 5123
Loop carried scalar dependence for 'rad1_dpsidr' at line 5123
Loop carried scalar dependence for 'rad1_r' at line 5066
Loop carried scalar dependence for 'rad1_r' at line 5068
Loop carried scalar dependence for 'rad1_r' at line 5070
Loop carried scalar dependence for 'rad1_r' at line 5123
Loop carried scalar dependence for 'rad1_r' at line 5124
5061, Loop carried scalar dependence for 'rad1_psi' at line 5123
Loop carried scalar dependence for 'rad1_dpsidr' at line 5123
Loop carried scalar dependence for 'rad1_r' at line 5066
Loop carried scalar dependence for 'rad1_r' at line 5068
Loop carried scalar dependence for 'rad1_r' at line 5070
Loop carried scalar dependence for 'rad1_r' at line 5123
Loop carried scalar dependence for 'rad1_r' at line 5124
5131, Accelerator restriction: induction variable live-out from loop: ialf
5187, No parallel kernels found, accelerator region ignored
5188, Loop carried scalar dependence for 'psi' at line 5261
Loop carried scalar dependence for 'psiy' at line 5342
Loop carried scalar dependence for 'psiy' at line 5345
Loop carried scalar dependence for 'psiz' at line 5342
Loop carried scalar dependence for 'psiz' at line 5346
Loop carried scalar dependence for 'psix' at line 5342
Loop carried scalar dependence for 'psix' at line 5344
Complex loop carried dependence of 'radarray' prevents parallelization
Complex loop carried dependence of 'garray' prevents parallelization
Complex loop carried dependence of 'gradx' prevents parallelization
Complex loop carried dependence of 'grady' prevents parallelization
Complex loop carried dependence of 'gradz' prevents parallelization
Complex loop carried dependence of 'surf' prevents parallelization
Complex loop carried dependence of 'xarray' prevents parallelization
Complex loop carried dependence of 'yarray' prevents parallelization
Complex loop carried dependence of 'zarray' prevents parallelization
Scalar last value needed after loop for 'z' at line 5475
Loop carried scalar dependence for 'dpsidr' at line 5261
Accelerator restriction: scalar variable live-out from loop: r
Accelerator restriction: scalar variable live-out from loop: psi
Accelerator restriction: scalar variable live-out from loop: z
Accelerator restriction: scalar variable live-out from loop: y
Accelerator restriction: scalar variable live-out from loop: x
Accelerator restriction: scalar variable live-out from loop: psixx
Accelerator restriction: scalar variable live-out from loop: psix
Accelerator restriction: scalar variable live-out from loop: psiz
Accelerator restriction: scalar variable live-out from loop: psiy
Accelerator restriction: scalar variable live-out from loop: coz
Accelerator restriction: scalar variable live-out from loop: coy
Accelerator restriction: scalar variable live-out from loop: cox
5190, Accelerator restriction: induction variable live-out from loop: ialf
5195, Loop carried scalar dependence for 'psi' at line 5261
Loop carried scalar dependence for 'psiy' at line 5342
Loop carried scalar dependence for 'psiy' at line 5345
Loop carried scalar dependence for 'psiz' at line 5342
Loop carried scalar dependence for 'psiz' at line 5346
Loop carried scalar dependence for 'psix' at line 5342
Loop carried scalar dependence for 'psix' at line 5344
Complex loop carried dependence of 'radarray' prevents parallelization
Complex loop carried dependence of 'garray' prevents parallelization
Loop carried dependence due to exposed use of 'garray(:)' prevents parallelization
Complex loop carried dependence of 'gradx' prevents parallelization
Loop carried dependence due to exposed use of 'gradx(:)' prevents parallelization
Complex loop carried dependence of 'grady' prevents parallelization
Loop carried dependence due to exposed use of 'grady(:)' prevents parallelization
Complex loop carried dependence of 'gradz' prevents parallelization
Loop carried dependence due to exposed use of 'gradz(:)' prevents parallelization
Complex loop carried dependence of 'surf' prevents parallelization
Loop carried dependence due to exposed use of 'surf(:)' prevents parallelization
Complex loop carried dependence of 'xarray' prevents parallelization
Complex loop carried dependence of 'yarray' prevents parallelization
Complex loop carried dependence of 'zarray' prevents parallelization
Scalar last value needed after loop for 'z' at line 5475
Loop carried scalar dependence for 'dpsidr' at line 5261
Loop carried scalar dependence for 'r' at line 5210
Loop carried scalar dependence for 'r' at line 5211
Loop carried scalar dependence for 'r' at line 5212
Loop carried scalar dependence for 'r' at line 5214
Loop carried scalar dependence for 'r' at line 5216
Loop carried scalar dependence for 'r' at line 5217
Loop carried scalar dependence for 'r' at line 5261
Loop carried scalar dependence for 'r' at line 5262
Loop carried scalar dependence for 'r' at line 5341
Loop carried scalar dependence for 'r' at line 5357
Loop carried scalar dependence for 'r' at line 5375
Accelerator restriction: scalar variable live-out from loop: r
Accelerator restriction: scalar variable live-out from loop: psi
Accelerator restriction: scalar variable live-out from loop: z
Accelerator restriction: scalar variable live-out from loop: y
Accelerator restriction: scalar variable live-out from loop: x
Accelerator restriction: scalar variable live-out from loop: psixx
Accelerator restriction: scalar variable live-out from loop: psix
Accelerator restriction: scalar variable live-out from loop: psiz
Accelerator restriction: scalar variable live-out from loop: psiy
Accelerator restriction: scalar variable live-out from loop: coz
Accelerator restriction: scalar variable live-out from loop: coy
Accelerator restriction: scalar variable live-out from loop: cox
Parallelization would require privatization of array 'zarray(:)'
Parallelization would require privatization of array 'yarray(:)'
Parallelization would require privatization of array 'xarray(:)'
Parallelization would require privatization of array 'radarray(:)'
Invariant if transformation
5196, Accelerator restriction: induction variable live-out from loop: ialf
5209, Scalar last value needed after loop for 'x' at line 5268
Scalar last value needed after loop for 'x' at line 5269
Scalar last value needed after loop for 'x' at line 5273
Scalar last value needed after loop for 'x' at line 5277
Scalar last value needed after loop for 'x' at line 5281
Scalar last value needed after loop for 'x' at line 5349
Scalar last value needed after loop for 'y' at line 5268
Scalar last value needed after loop for 'y' at line 5269
Scalar last value needed after loop for 'y' at line 5273
Scalar last value needed after loop for 'y' at line 5275
Scalar last value needed after loop for 'y' at line 5350
Scalar last value needed after loop for 'z' at line 5268
Scalar last value needed after loop for 'z' at line 5269
Scalar last value needed after loop for 'z' at line 5276
Scalar last value needed after loop for 'z' at line 5351
Scalar last value needed after loop for 'z' at line 5475
Loop carried scalar dependence for 'psi' at line 5261
Loop carried scalar dependence for 'dpsidr' at line 5261
Loop carried scalar dependence for 'r' at line 5210
Loop carried scalar dependence for 'r' at line 5211
Loop carried scalar dependence for 'r' at line 5212
Loop carried scalar dependence for 'r' at line 5214
Loop carried scalar dependence for 'r' at line 5216
Loop carried scalar dependence for 'r' at line 5217
Loop carried scalar dependence for 'r' at line 5261
Loop carried scalar dependence for 'r' at line 5262
Scalar last value needed after loop for 'r' at line 5341
Scalar last value needed after loop for 'r' at line 5357
Scalar last value needed after loop for 'r' at line 5375
Accelerator restriction: scalar variable live-out from loop: r
Accelerator restriction: scalar variable live-out from loop: psi
Accelerator restriction: scalar variable live-out from loop: z
Accelerator restriction: scalar variable live-out from loop: y
Accelerator restriction: scalar variable live-out from loop: x
I suspect I need to the arrays like gradx, grady, need to be saved. However, I
don't understand the complex dependencies, like why this error occurs:
Accelerator restriction: induction variable live-out from loop: ialf
The value of the latitude of the pixel (theta) is set from the index.
The longitude (phi) is also set from an index, but the compiler does
not care about that.
In this code, I need to compute values for various pixels and save them in arrays
in a few other places, so I would appreciate any advice on how this can be
done in parallel.
I am using pgfortran 10.3. We have a Linux box with a quadcore i7 and two
Nvidia C1060 cards.
Thanks,
Jerry
P.S. upon previewing, the spacing in the code snippet is messed up.
Hopefully it is still understandable. |
|
| Back to top |
|
 |
mkcolg
Joined: 30 Jun 2004 Posts: 4996 Location: The Portland Group Inc.
|
Posted: Mon Mar 29, 2010 3:14 pm Post subject: |
|
|
Hi Jerry,
Let's walk through a few of these messages for the code shown.
| Quote: | | Loop carried scalar dependence for 'psiy' at line 5342 |
The "PSI" variables are all initialized within an IF statement. In serial code this is fine since the values from the previous iteration will be used. In parallel, scalars need to be "privatized" so that each thread has their own copy and can not be dependent upon previous iterations. To fix, you need to initialize these variables outside the IF statement or in an else clause.
| Quote: | | Complex loop carried dependence of 'radarray' prevents parallelization | For these, the problem is your use of the calculated index "iidx". At compile time, there is no way to know that the values of "iidx=mmdx(ialf,ibet)" are independent. The compiler must assume the worst case where all values of mmdx would the same and all threads are updating the same "iidx". In this case, the actual value returned would be non-deterministic.
To fix, you'll need to create temporary results arrays to store the values for each thread. Then perform the gather operation serially on the host.
| Quote: | | Accelerator restriction: scalar variable live-out from loop: r |
The compiler will attempt to parallelize all DO loops. However, you're using variables calculated in the innermost loop in the middle loop. This makes them "live" (the value is used on the right-hand size) and again you could have non-deterministic results depending upon which thread's value is used.
In this case, you actually want the inner most loop to be sequential. To fix, add the "do kernel" clause above the "ibet" loop: | Code: |
!$acc do kernel
DO 1105 ibet=1, 4*Nbet
|
This tells the compiler to still parallelize the ibet loop but everything in the loop becomes the body of the device kernel code and is executed sequentially.
Hope this helps,
Mat |
|
| Back to top |
|
 |
Jerry Orosz
Joined: 02 Jan 2008 Posts: 12 Location: San Diego
|
Posted: Fri Apr 02, 2010 10:05 pm Post subject: loop parallelized, but code is slower |
|
|
Hi Mat,
Thanks for the help. I easily fixed two of the problems by removing
the if-statement and adding the !$acc do kernel directive.
The problem with the iidx array counter can be solved by
changing the arrays back to two dimensions. Several years ago
I made most of the 2-dimensional arrays one dimensional to
speed up the performance, and I suppose it is necessary to change
them back if these loops are to be parallelized.
I managed to get a similar loop structure to compile for parallel
execution. However, the code takes much longer to run. The run
time is (on the wall clock) is about 17 seconds when the code is
compiled for the host CPU only:
pgfortran -Mextend -O2 -o serialELC ELC.for
The run time goes up to about 35 seconds when compiled for the GPU:
pgfortran -Mextend -O2 -o ELC ELC.for -ta=nvidia,time
Here is the loop in the subroutine:
| Code: |
Na=120
Nb=40
dtheta=pie/dble(Na)
vol=0.0d0
!$acc region
do 104 ialf=1,Na
r=1.0d-12
psi=1.0d0
dpsidr=1.0d0
theta=-0.5d0*dtheta+dtheta*dble(ialf)
dphi=twopie/dble(4*Nb)
snth=dsin(theta)
snth3=snth*(1.0d0/3.0d0)
cnth=dcos(theta)
!$acc do kernel
DO 105 ibet=1,Nb*4
phi=-0.5d0*dphi+dphi*dble(ibet)
cox=dcos(phi)*snth
coy=dsin(phi)*snth
coz=cnth
do irad=1,190 !subroutine rad
x=r*cox
y=r*coy
z=r*coz
t1=(bdist*bdist-2.0d0*cox*r*bdist+r*r)
t2=0.5d0*omega*omega*(1.0d0+overQ)*(1.0d0-coz*coz)
psi=1.0d0/r+overQ*(1.0d0/dsqrt(t1)-cox*r/(bdist*bdist))+r*r*t2
dpsidr=-1.0d0/(r*r)+overQ*((dsqrt(t1)**3)*(cox*bdist-r)
% -cox/(bdist*bdist))+t2*2.0d0*r
rnew=r-(psi-psi0)/dpsidr
dr=dabs(rnew-r)
if(dabs(dr).lt.1.0d-19)go to 1115
r=rnew
enddo
1115 continue
c
VOL = VOL + 1.0d0*R*R*R*dphi*dtheta*snth3
105 CONTINUE ! continue ibet loop
104 CONTINUE ! continue over ialf
!$acc end region
|
Here is the compile command:
pgfortran -Mextend -O2 -o ELC ELC.for -ta=nvidia,time -Minfo=accel
The compiler spit out these messages:
| Code: |
19524, Generating compute capability 1.3 kernel
19525, Loop is parallelizable
Accelerator kernel generated
19525, !$acc do parallel, vector(120)
19562, Sum reduction generated for vol
19536, Loop carried scalar dependence for 'r' at line 19549
Loop carried scalar dependence for 'r' at line 19551
Loop carried scalar dependence for 'r' at line 19552
Loop carried scalar dependence for 'r' at line 19555
Loop carried scalar dependence for 'r' at line 19556
Loop carried scalar dependence for 'r' at line 19562
19545, Loop carried scalar dependence for 'r' at line 19549
Loop carried scalar dependence for 'r' at line 19551
Loop carried scalar dependence for 'r' at line 19552
Loop carried scalar dependence for 'r' at line 19555
Loop carried scalar dependence for 'r' at line 19556
Scalar last value needed after loop for 'r' at line 19562
Accelerator restriction: scalar variable live-out from loop: r
|
I guess the compiler is relatively happy, since it made a kernel for the GPU. When I run the code, I get the following:
| Code: |
setupgeo
5382: region entered 2 times
time(us): init=0
/home/orosz/lightcurve/./lcsubs.for
findradius
19524: region entered 122 times
time(us): total=24353792 init=52636 region=24301156
kernels=24291961 data=9195
w/o init: total=24301156 max=379268 min=9607 avg=199189
19525: kernel launched 122 times
grid: [1] block: [120]
time(us): total=24289311 max=379188 min=9533 avg=199092
19562: kernel launched 122 times
grid: [1] block: [256]
time(us): total=2650 max=105 min=15 avg=21
|
If I am reading this correctly, the GPU spend about 24 seconds
computing the loops. As noted above, the host CPU runs the entire
code in about 17 seconds, and this includes lots of other stuff in
addition to the 122 calls to the subroutine. The time spend moving
data in and out of memory appears to be minimal, as does the
execution time.
As mentioned in the original post, the system is a quadcore i7 with
two Nvidia C1060 cards:
| Code: |
> pgaccelinfo
CUDA Driver Version 2030
Device Number: 0
Device Name: Tesla C1060
Device Revision Number: 1.3
Global Memory Size: 4294705152
Number of Multiprocessors: 30
Number of Cores: 240
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 16384
Registers per Block: 16384
Warp Size: 32
Maximum Threads per Block: 512
Maximum Block Dimensions: 512, 512, 64
Maximum Grid Dimensions: 65535 x 65535 x 1
Maximum Memory Pitch: 262144B
Texture Alignment 256B
Clock Rate: 1296 MHz
Initialization time: 9430 microseconds
Current free memory 4246142976
Upload time (4MB) 891 microseconds ( 714 ms pinned)
Download time 984 microseconds ( 731 ms pinned)
Upload bandwidth 4707 MB/sec (5874 MB/sec pinned)
Download bandwidth 4262 MB/sec (5737 MB/sec pinned)
Device Number: 1
Device Name: Quadro FX 380
Device Revision Number: 1.1
Global Memory Size: 268107776
Number of Multiprocessors: 2
Number of Cores: 16
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 16384
Registers per Block: 8192
Warp Size: 32
Maximum Threads per Block: 512
Maximum Block Dimensions: 512, 512, 64
Maximum Grid Dimensions: 65535 x 65535 x 1
Maximum Memory Pitch: 262144B
Texture Alignment 256B
Clock Rate: 1100 MHz
Initialization time: 9430 microseconds
Current free memory 220147712
Upload time (4MB) 1535 microseconds (1357 ms pinned)
Download time 1481 microseconds (1273 ms pinned)
Upload bandwidth 2732 MB/sec (3090 MB/sec pinned)
Download bandwidth 2832 MB/sec (3294 MB/sec pinned)
Device Number: 2
Device Name: Tesla C1060
Device Revision Number: 1.3
Global Memory Size: 4294705152
Number of Multiprocessors: 30
Number of Cores: 240
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 16384
Registers per Block: 16384
Warp Size: 32
Maximum Threads per Block: 512
Maximum Block Dimensions: 512, 512, 64
Maximum Grid Dimensions: 65535 x 65535 x 1
Maximum Memory Pitch: 262144B
Texture Alignment 256B
Clock Rate: 1296 MHz
Initialization time: 9430 microseconds
Current free memory 4246142976
Upload time (4MB) 1515 microseconds (1345 ms pinned)
Download time 1469 microseconds (1272 ms pinned)
Upload bandwidth 2768 MB/sec (3118 MB/sec pinned)
Download bandwidth 2855 MB/sec (3297 MB/sec pinned)
|
The two Tesla cards should have the capability to run double precision.
When I set the ACC_NOTIFY environment variable, I get messages like
| Quote: |
launch kernel file=/home/orosz/lightcurve/./lcsubs.for function=findradius line=19525 device=0 grid=1 block=120
launch kernel file=/home/orosz/lightcurve/./lcsubs.for function=findradius line=19562 device=0 grid=1 block=256
|
Apparently, the system is using the Tesla card at device 0 to compute
the loops, and not the video card at device=1 that runs the
screen saver.
My bottom line question: Why are these parallel loops running so slow?
Thanks in advance,
Jerry |
|
| Back to top |
|
 |
mkcolg
Joined: 30 Jun 2004 Posts: 4996 Location: The Portland Group Inc.
|
Posted: Mon Apr 05, 2010 12:52 pm Post subject: |
|
|
Hi Jerry,
Next, I'd try adjusting the schedule using the "!$acc do" "parallel" and "vector" clauses. (Such as !$acc do parallel(4), vector(30)") Though, I'm not sure how much this will help.
Are you able to increase the value of "Na"? I'm concerned that 120 is too small to see any significant speed-up. While there are a lot of cores on a Tesla, individually they aren't very fast. If not, you may consider using OpenMP instead and run your code on a multi-core CPU.
- Mat |
|
| Back to top |
|
 |
Jerry Orosz
Joined: 02 Jan 2008 Posts: 12 Location: San Diego
|
Posted: Mon Apr 05, 2010 3:20 pm Post subject: |
|
|
| mkcolg wrote: | Hi Jerry,
Next, I'd try adjusting the schedule using the "!$acc do" "parallel" and "vector" clauses. (Such as !$acc do parallel(4), vector(30)") Though, I'm not sure how much this will help.
Are you able to increase the value of "Na"? I'm concerned that 120 is too small to see any significant speed-up. While there are a lot of cores on a Tesla, individually they aren't very fast. If not, you may consider using OpenMP instead and run your code on a multi-core CPU.
- Mat |
Hi Mat,
I cranked up the values of Na and Nb, but the difference between the serial and parallel versions becomes even worse.
I am wondering what exactly is being made parallel in this case. Here is the current code part. I tried the "do parallel" directives (briefly) but they did not help, so I have them commented out here (I really should read up on exactly how they are used.)
| Code: |
!$acc data region
!$acc region
c!$acc do parallel
do 104 ialf=1,Na
c
r=1.0d-12
psi=1.0d0
dpsidr=1.0d0
theta=-0.5d0*dtheta+dtheta*dble(ialf)
dphi=twopie/dble(4*Nb)
snth=dsin(theta)
snth3=snth*(1.0d0/3.0d0)
cnth=dcos(theta)
c!$acc do parallel
DO 105 ibet=1,Nb*4
phi=-0.5d0*dphi+dphi*dble(ibet)
cox=dcos(phi)*snth
coy=dsin(phi)*snth
coz=cnth
!$acc do kernel
do irad=1,60 !160 !subroutine rad
x=r*cox
y=r*coy
z=r*coz
t1=(bdist*bdist-2.0d0*cox*r*bdist+r*r)
t2=0.5d0*omega*omega*(1.0d0+overQ)*(1.0d0-coz*coz)
psi=1.0d0/r+overQ*(1.0d0/dsqrt(t1)-cox*r/(bdist*bdist))+r*r*t2
dpsidr=-1.0d0/(r*r)+overQ*((dsqrt(t1)**3)*(cox*bdist-r)
% -cox/(bdist*bdist))+t2*2.0d0*r
rnew=r-(psi-psi0)/dpsidr
dr=dabs(rnew-r)
c if(dabs(dr).lt.1.0d-15)go to 1115
r=rnew
enddo
1115 continue
VOL = VOL + 1.0d0*R*R*R*dphi*dtheta*snth3
105 CONTINUE
104 CONTINUE
!$acc end region
!$acc end data region
|
Here is the compiler message:
| Code: |
19528, Generating compute capability 1.3 kernel
19530, Loop is parallelizable
Accelerator kernel generated
19530, !$acc do parallel, vector(256)
19562, Sum reduction generated for vol
19541, Loop carried scalar dependence for 'r' at line 19551
Loop carried scalar dependence for 'r' at line 19553
Loop carried scalar dependence for 'r' at line 19554
Loop carried scalar dependence for 'r' at line 19556
19547, Loop carried scalar dependence for 'r' at line 19551
Loop carried scalar dependence for 'r' at line 19553
Loop carried scalar dependence for 'r' at line 19554
Loop carried scalar dependence for 'r' at line 19556
Scalar last value needed after loop for 'r' at line 19562
Accelerator restriction: scalar variable live-out from loop: r
Inner sequential loop scheduled on accelerator
|
There is still an accelerator restriction message. I am not sure if this matters with regards to the actual performance, but I suppose at some level it must, since the performance gets so bad. I have compiled and ran some of the simple test codes from the PGI pages (from the 4 part series by Michael Wolfe), and the codes run on the GPU outperforms the same code ran on the CPU.
Finally, I also commented out the statement in the innermost loop that bails when the convergence is reached. This ensures that the same number of operations are done in serial and in parallel mode. In the serial mode, only a few iterations are needed once the first iteration at a given latitude row it done. So the comparison of run times should be good since the same number of operations are done in each case.
Jerry |
|
| Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
Powered by phpBB © 2001, 2002 phpBB Group
|