|
| View previous topic :: View next topic |
| Author |
Message |
mkcolg
Joined: 30 Jun 2004 Posts: 4996 Location: The Portland Group Inc.
|
Posted: Fri Nov 26, 2004 9:42 am Post subject: |
|
|
Hi Craig,
The office is closed for a few days due to the Thanksgiving holiday so I don't have access to WRF. We'll be back on Monday so if you don't mind waiting I'll see what I can determine then.
Thanks,
Mat |
|
| Back to top |
|
 |
mkcolg
Joined: 30 Jun 2004 Posts: 4996 Location: The Portland Group Inc.
|
Posted: Wed Dec 01, 2004 11:29 am Post subject: |
|
|
Hi Craig,
Unfortuntately, I have not been able to recreate the error so don't have a good idea how to fix it. Is it possible for you characterize how is seg faulting?
I'd like you to re-build with "-g -O0 -mp" and re-run. If it still seg faults, run it again in pgdbg or gdb and determine which file and which line it seg faults at. (use the 'where' and 'disasm' commands) If it does not seg fault at -O0, then continue adding higher optimization until it does, i.e. "-O2 -g -mp", "-fast -g -mp", -fastsse -g -mp".
Thanks,
Mat |
|
| Back to top |
|
 |
Craig Arthur
Joined: 01 Sep 2004 Posts: 5
|
Posted: Tue Dec 07, 2004 4:29 pm Post subject: |
|
|
Hi Mat,
I started out with the basic ‘-g –O0 –mp’ flag set, and compilation failed with a long list of errors. So I stepped back to the default set (as in those in the configure.wrf I posted previously) and progressively worked back to a point which compilation was successful. The most basic set I could get down to was ‘-g –O0 –mp –byteswapio –Mfree’ (I can’t find any mention of ‘-Mfree’ in the PGF User Guide, so I’m unsure of its effect).
I ran the compiled executable in pgdbg, with pgienv omp on, and I can reach the first OMP command, which is in the subroutine SOLVE_EM.
| Code: | pgdbg> step
Stopped at 0x490705, function solve_em, file solve_em.f, line 1523
#1523: !$OMP PARALLEL DO &
pgdbg> step
pgserv 27022: pr_ptrace (req PTRACE_PEEKTEXT, pid 27023)
pgserv 27022: read: unable to read address 0x490728
pgserv 27022: pr_ptrace (req PTRACE_PEEKTEXT, pid 27023)
pgserv 27022: read: unable to read address 0x85f2d0
pgserv 27022: pr_ptrace (req PTRACE_PEEKTEXT, pid 27023)
pgserv 27022: read: unable to read address 0x8600e0
pgserv 27022: pr_ptrace (req PTRACE_PEEKTEXT, pid 27023)
pgserv 27022: read: unable to read address 0x85fe58
pgserv 27022: pr_ptrace (req PTRACE_PEEKTEXT, pid 27023)
pgserv 27022: read: unable to read address 0x4ca6f0
pgserv 27022: pr_ptrace (req PTRACE_PEEKTEXT, pid 27023)
pgserv 27022: read: unable to read address 0xb7a688
Stopped at 0x49072a, function solve_em, file solve_em.f, line 1526
#1526: DO ij = 1 , grid%num_tiles
pgdbg> step
|
The relevant code lines are
| Code: |
!$OMP PARALLEL DO &
!$OMP PRIVATE ( ij )
DO ij = 1 , grid%num_tiles
CALL rk_step_prep ( config_flags, rk_step, &
u_2, v_2, w_2, t_2, ph_2, mu_2, &
moist_2, &
ru, rv, rw, ww, php, alt, muu, muv, &
mub, mut, phb, pb, p, al, alb, &
cqu, cqv, cqw, &
msfu, msfv, msft, &
fnm, fnp, dnw, rdx, rdy, &
num_3d_m, &
ids, ide, jds, jde, kds, kde, &
ims, ime, jms, jme, kms, kme, &
grid%i_start(ij), grid%i_end(ij), &
grid%j_start(ij), grid%j_end(ij), &
k_start, k_end )
END DO
!$OMP END PARALLEL DO
|
And on stepping into the DO loop, the debugger dies reporting
| Code: | pgserv 27022: read: stranger PID 27023
db_set_code_brk : DiBreakpointSet fails
pgserv 27022: cont : no threads to continue
|
I decided it worth running an idealised case compiled with the same configure.wrf (em_quarter_ss), as it does run on 2 cpu's. The debugger dies at the same location as in the real case, reporting the same errors. As such, I don't think I'm actually reaching the point where em_real is seg faulting. |
|
| Back to top |
|
 |
mkcolg
Joined: 30 Jun 2004 Posts: 4996 Location: The Portland Group Inc.
|
Posted: Wed Dec 08, 2004 11:48 am Post subject: |
|
|
Hi Craig,
Sorry I should have been more clear and said to change just the "FCOPTIM" flag and leave the "FCBASEOPTS" as is. Also, "-Mfree" and "-Mfixed" override the extension (.F, .F90, .f, .f90) to indicate if the file is free or fixed form.
Since 5.1-6 pre-dates Fedora Core 2 and a lot changed with the thread library, the 5.1 version of pgdbg can not step through parallel regions. Again, I should have been more clear. Please run the application without stepping and let it run until it seg faults. Then use the "where" command to see where your at in the program and "diasm" to see what assembly instructions were being executed. Also, please run the exe outside of the debugger to ensure that it does indeed still seg fault at the lower optimization.
Since 5.1-6 does not offically support Fedora Core 2, I'd also like you to try upgrading to 5.2-4 http://www.pgroup.com/support/download_release.php. It is possible that we have an incompatabily between 5.1-6 and Fedora Core 2. Also, the debugger when through a major upgrade. Note that we upgraded your license to 5.2 but you'll need to regenerate your license key in order for the 5.2 compilers to work beyond the 15 day evaluation.
Thanks,
Mat |
|
| Back to top |
|
 |
Craig Arthur
Joined: 01 Sep 2004 Posts: 5
|
Posted: Thu Dec 16, 2004 3:51 pm Post subject: |
|
|
Hi Mat,
I've gone through the steps you set out above, and the executable continues to seg fault. Below is one example of the output from the debugger when running wrf.exe compiled with "-g -O0 -mp".
| Code: |
([1] New Thread)
WRF NUMBER OF TILES FROM OMP_GET_MAX_THREADS = 2
WRF NUMBER OF TILES = 2
[0] Signalled SIGSEGV at 0x5D88EB, function surface_driver, file module_surface_driver.f, line 374
5D88EB: F3 F 11 4 8A movss %xmm0,(%rdx,%rcx,4)
pgdbg [all] 0> where
surface_driver line: "module_surface_driver.f"@374 address: 0x5D88EB
pgdbg [all] 0> disasm
5D88EB: F3 F 11 4 8A movss %xmm0,(%rdx,%rcx,4)
5D88F0: FF 85 50 FF FF FF incl -176(%rbp)
5D88F6: FF 8C 24 60 1 0 0 decl 352(%rsp)
pgdbg [all] 0> threads
0 ID PID STATE SIGNAL LOCATION
=> 0 30926 Signalled SIGSEGV surface_driver line: "module_surface_driver.f"@374 address: 0x5D88EB
1 30927 Stopped SIGSTOP __GI_sched_yield file: interp.c address: 0x3EE7DA4129
|
The catch is though, the seg fault is not consistent in where it occurs. I have found about 5 different points where execution stops, most often in the surface_driver function.
For this reason, I'm starting to suspect the compiler is not the direct source of the issue. I'll play around some more with the debugger to see if I can glean any more information about what's occuring. |
|
| Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
Powered by phpBB © 2001, 2002 phpBB Group
|