PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

Cactus BenchADM crashes
Goto page 1, 2  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
skimmed



Joined: 19 Oct 2009
Posts: 4

PostPosted: Wed Oct 21, 2009 5:09 am    Post subject: Cactus BenchADM crashes Reply with quote

Hi
I downloaded Cactus BenchADM benchmark and followed its tutorial.txt (as well as the article "Building Cactus BenchADM with PGI accelerator compilers" by Mathew Colgrove) to build and run the code. The cpu version compiles and runs correctly. The CUDA version (StaggeredLeapfrog2_acc1.F, came with the package) crashed during the run, although it complied correctly. I then tried other steps:acc2, acc3, they all gave the same behaviour.

I noticed that in the compiler message it shows
" 367, !$acc do parallel, vector(2)
371, !$acc do parallel, vector(3)" while the tutorial documents showed "vector(8)" for the same bits. I don't know why they are different.

pgaccelinfo runs fine and the code compiles, so I guess I installed both CUDA and the compiler correctly.
I would appreciate any suggestions on what I need to do to make the run.

My system is RedHat 5.1, kernel 2.6.18-128.el5 x86_64 SMP
PGI 9.0.4
tesla c1060
CUDA 2.3


The error messages are:
[tester@bra-tesladev1 PGI_Acc_benchADM]$ make SIZE=120 OPT="-fast -ta=nvidia,time -Minfo=accel" build_acc1 run_acc1
pgfortran -fast -ta=nvidia,time -Minfo=accel -c -o objdir/StaggeredLeapfrog2_acc1.o ./src/StaggeredLeapfrog2_acc1.F
NOTE: your trial license will expire in 12 days, 11.2 hours.
NOTE: your trial license will expire in 12 days, 11.2 hours.
bench_staggeredleapfrog2:
366, Generating copyout(adm_kzz_stag(2:nx-2+1,2:ny-2+1,2:nz-2+1))
Generating copyout(adm_kyz_stag(2:nx-2+1,2:ny-2+1,2:nz-2+1))
Generating copyin(lalp(1:nx-2+2,1:ny-2+2,1:nz-2+2))
Generating copyout(adm_kyy_stag(2:nx-2+1,2:ny-2+1,2:nz-2+1))
Generating copyout(adm_kxz_stag(2:nx-2+1,2:ny-2+1,2:nz-2+1))
Generating copyout(adm_kxy_stag(2:nx-2+1,2:ny-2+1,2:nz-2+1))
Generating copyout(adm_kxx_stag(2:nx-2+1,2:ny-2+1,2:nz-2+1))
Generating copyin(lgzz(1:nx-2+2,1:ny-2+2,1:nz-2+2))
Generating copyin(lgyz(1:nx-2+2,1:ny-2+2,1:nz-2+2))
Generating copyin(lgyy(1:nx-2+2,1:ny-2+2,1:nz-2+2))
Generating copyin(lgxz(1:nx-2+2,1:ny-2+2,1:nz-2+2))
Generating copyin(lgxy(1:nx-2+2,1:ny-2+2,1:nz-2+2))
Generating copyin(lgxx(1:nx-2+2,1:ny-2+2,1:nz-2+2))
Generating copyin(adm_kzz_stag_p_p(2:nx-2+1,2:ny-2+1,2:nz-2+1))
Generating copyin(adm_kzz_stag_p(2:nx-2+1,2:ny-2+1,2:nz-2+1))
Generating copyin(adm_kyz_stag_p_p(2:nx-2+1,2:ny-2+1,2:nz-2+1))
Generating copyin(adm_kyz_stag_p(2:nx-2+1,2:ny-2+1,2:nz-2+1))
Generating copyin(adm_kyy_stag_p_p(2:nx-2+1,2:ny-2+1,2:nz-2+1))
Generating copyin(adm_kyy_stag_p(2:nx-2+1,2:ny-2+1,2:nz-2+1))
Generating copyin(adm_kxz_stag_p_p(2:nx-2+1,2:ny-2+1,2:nz-2+1))
Generating copyin(adm_kxz_stag_p(2:nx-2+1,2:ny-2+1,2:nz-2+1))
Generating copyin(adm_kxy_stag_p_p(2:nx-2+1,2:ny-2+1,2:nz-2+1))
Generating copyin(adm_kxy_stag_p(2:nx-2+1,2:ny-2+1,2:nz-2+1))
Generating copyin(adm_kxx_stag_p_p(2:nx-2+1,2:ny-2+1,2:nz-2+1))
Generating copyin(adm_kxx_stag_p(2:nx-2+1,2:ny-2+1,2:nz-2+1))
367, Loop is parallelizable
371, Loop is parallelizable
375, Loop is parallelizable
Accelerator kernel generated
367, !$acc do parallel, vector(2)
371, !$acc do parallel, vector(3)
375, !$acc do vector(16)
Using register for 'adm_kxx_stag_p'
Using register for 'adm_kxy_stag_p'
Using register for 'adm_kxz_stag_p'
Using register for 'adm_kyy_stag_p'
Using register for 'adm_kyz_stag_p'
Using register for 'adm_kzz_stag_p'
Non-stride-1 accesses for array 'lgxx'
Non-stride-1 accesses for array 'lgxy'
Cached references to size [18x5x4] block of 'lgxz'
Cached references to size [18x5x4] block of 'lgyy'
Cached references to size [18x5x4] block of 'lgyz'
Cached references to size [18x5x4] block of 'lgzz'
Cached references to size [18x5x4] block of 'lalp'
pgfortran objdir/PreLoop.o objdir/StaggeredLeapfrog1a.o objdir/StaggeredLeapfrog1a_TS.o objdir/planewaves.o objdir/teukwaves.o /cctk_ThornBindings.o objdir/StaggeredLeapfrog2_acc1.o objdir/Cactus.......
............
/InitialiseCactus_acc.o -fast -ta=nvidia,time -Minfo=accel -Mnomain -o bin/benchADM_acc1
time bin/benchADM_acc1 BenchADM_40l_120.par
--------------------------------------------------------------------------------

10
1 0101 ************************
01 1010 10 The Cactus Code V4.0
1010 1101 011 www.cactuscode.org
1001 100101 ************************
00010101
100011 (c) Copyright The Authors
0100 GNU Licensed. No Warranty
0101

--------------------------------------------------------------------------------

Cactus version: 4.0.b11
Parameter file: BenchADM_40l_120.par
--------------------------------------------------------------------------------

Activating thorn Cactus...Success -> active implementation Cactus
Activation requested for
--->einstein time benchadm pugh pughreduce cartgrid3d ioutil iobasic<---
Activating thorn benchadm...Success -> active implementation benchadm
Activating thorn cartgrid3d...Success -> active implementation grid
Activating thorn einstein...Success -> active implementation einstein
Activating thorn iobasic...Success -> active implementation IOBasic
Activating thorn ioutil...Success -> active implementation IO
Activating thorn pugh...Success -> active implementation driver
Activating thorn pughreduce...Success -> active implementation reduce
Activating thorn time...Success -> active implementation time
--------------------------------------------------------------------------------
if (recover)
Recover parameters
endif

Startup routines
BenchADM: Register slicings
CartGrid3D: Register GH Extension for GridSymmetry
CartGrid3D: Register coordinates for the Cartesian grid
PUGH: Startup routine
IOUtil: Startup routine
IOBasic: Startup routine
PUGHReduce: Startup routine.

Parameter checking routines
BenchADM: Check parameters
CartGrid3D: Check coordinates for CartGrid3D

Initialisation
CartGrid3D: Set up spatial 3D Cartesian coordinates on the GH
Einstein: Set up GF symmetries
Einstein: Initialize slicing, setup priorities for mixed slicings
PUGH: Report on PUGH set up
Time: Initialise Time variables
Time: Set timestep based on Courant condition
Einstein: Initialisation for Einstein methods
Einstein: Flat initial data
BenchADM: Setup for ADM
Einstein: Set initial lapse to one
BenchADM: Time symmetric initial data for staggered leapfrog
if (recover)
endif
if (checkpoint initial data)
endif
if (analysis)
Einstein: Compute the trace of the extrinsic curvature
Einstein: Calculate the spherical metric in r,theta(q), phi(p)
Einstein: Calculate the spherical ex. curvature in r, theta(q), phi(p)
endif

do loop over timesteps
Rotate timelevels
iteration = iteration + 1
t = t+dt
Einstein: Identify the slicing for the next iteration
BenchADM: Evolve using Staggered Leapfrog
if (checkpoint)
endif
if (analysis)
Einstein: Compute the trace of the extrinsic curvature
Einstein: Calculate the spherical metric in r,theta(q), phi(p)
Einstein: Calculate the spherical ex. curvature in r, theta(q), phi(p)
endif
enddo
Termination routines
PUGH: Termination routine
Shutdown routines
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Driver provided by PUGH
--------------------------------------------------------------------------------

INFO (IOBasic): I/O Method 'Scalar' registered
INFO (IOBasic): Scalar: Output of scalar quantities (grid scalars, reductions) to ASCII files
INFO (IOBasic): I/O Method 'Info' registered
INFO (IOBasic): Info: Output of scalar quantities (grid scalars, reductions) to screen
INFO (BenchADM): Evolve using the ADM system
INFO (BenchADM): with staggered leapfrog
INFO (CartGrid3D): Grid Spacings:
INFO (CartGrid3D): dx=>8.4033613e-03 dy=>8.4033613e-03 dz=>8.4033613e-03
INFO (CartGrid3D): Computational Coordinates:
INFO (CartGrid3D): x=>[-0.500, 0.500] y=>[-0.500, 0.500] z=>[-0.500, 0.500]
INFO (CartGrid3D): Indices of Physical Coordinates:
INFO (CartGrid3D): x=>[0,119] y=>[0,119] z=>[0,119]
INFO (PUGH): Single processor evolution
INFO (PUGH): 3-dimensional grid functions
INFO (PUGH): Size: 120 120 120
INFO (Einstein): Setting flat Minkowski space in Einstein
INFO (IOBasic): Info: Output every 10 iterations
INFO (IOBasic): Info: Output requested for EINSTEIN::gxx EINSTEIN::alp
------------------------------------------------------------------------------
it | | EINSTEIN::gxx | EINSTEIN::alp |
| t | minimum | maximum | minimum | maximum |
------------------------------------------------------------------------------
0 | 0.000 | 1.00000000 | 1.00000000 | 1.00000000 | 1.00000000 |
call to ctxSynchronize returned error 700: Launch failed

Accelerator Kernel Timing data
./src/StaggeredLeapfrog2_acc1.F
bench_staggeredleapfrog2
366: region entered 1 time
time(us): init=1
375: kernel launched 1 times
grid: [59x40] block: [16x3x2]
time(us): total=0 max=0 min=0 avg=0
acc_init.c
acc_init
1: region entered 1 time
time(us): init=51061
Command exited with non-zero status 1
1.12user 0.66system 0:01.79elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+183167minor)pagefaults 0swaps
make: *** [run_acc1] Error 1
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6021
Location: The Portland Group Inc.

PostPosted: Wed Oct 21, 2009 8:47 am    Post subject: Reply with quote

Hi Skimmed,

A "ctxSynchronize returned error 700" error typically means that when copying over the data to the device, there was an access violation. Exactly why this is occurring, I'm not sure. Your -Minfo output looks correct (the vector message is just a difference between 9.0-4 and 9.0-3 which is what I used to write the tutorial).

The first thing I'd try is to reboot your system. I've seen a few times where the device driver gets messed up and starts giving odd errors like this.

Next, set "NVDEBUG=1" in your environment. This will give you a lot of debug information but show exactly which variable is causing the crash.

Also, try one of the smaller examples found in "$PGI/linux86-64/9.0-4/etc/samples". If these fail as well, then I'm leaning towards a system issue rather than compiler.

- Mat
Back to top
View user's profile
skimmed



Joined: 19 Oct 2009
Posts: 4

PostPosted: Wed Oct 21, 2009 9:58 am    Post subject: Reply with quote

Thanks, Mat.

A reboot eventually sorted things out and now the code runs. However I noticed that compared with your results, my data value (27132909 vs. 7112575) is almost four times as big. Is there a way to improve on this by tuning compiler options or is it limited by hardware?


Accelerator Kernel Timing data
./src/StaggeredLeapfrog2_acc3.F
bench_staggeredleapfrog2
369: region entered 100 times
time(us): total=35310202 init=99 region=35310103
kernels=8177194 data=27132909
w/o init: total=35310103 max=382721 min=351194 avg=353101
410: kernel launched 100 times
grid: [118x15] block: [8x32]
time(us): total=8177194 max=82600 min=81376 avg=81771
acc_init.c
acc_init
1: region entered 1 time
time(us): init=51528
54.93user 8.19system 1:03.30elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (17major+2205915minor)pagefaults 0swaps
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6021
Location: The Portland Group Inc.

PostPosted: Wed Oct 21, 2009 11:00 am    Post subject: Reply with quote

Check which PCI slot your card is plugged into. I had a similar issue when I had a card in slot with a x4 link speed instead of the x16 link. You might need to check your motherboard documentation to determine which PCIe slot is which. Most likely the PCIe slots closest the CPU are the x16 link.

- Mat
Back to top
View user's profile
skimmed



Joined: 19 Oct 2009
Posts: 4

PostPosted: Thu Oct 22, 2009 4:47 am    Post subject: Reply with quote

Mat
Thanks very much for your help.
The machine (dell precision 690) has two PCI-E 16x slots which are occupied by a Tesla c1060 and a quadro fx1400. No matter which slots the Tesla was in, I got exactly the same results.
CUDA bandwidth test showed it had 1300MB/s uploading and 988MB/s downloading, which are very slow.
It appears to be a configuration issue but at the moment I have no clue on how to solve it.
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group