PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

mvapich - mpirun timeouts

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Licenses and Installation
View previous topic :: View next topic  
Author Message
Sylvain K



Joined: 13 Oct 2009
Posts: 21

PostPosted: Tue Jun 05, 2012 9:13 am    Post subject: mvapich - mpirun timeouts Reply with quote

I've compiled the same code, using PGI's CDK v 12.4. One version is compiled, linked and run w/ mpich, the other w/ mvapcih.

Running the mpich version (w/ .../mpich/bin/mpirun) is fine, but running the mvapich (w/ ../mvapich/bin/mpirun) produce a large set of cases with
Code:
Timeout during client startup.
ERROR: Reached mpirun timeout.  Attempting to cleanup job.
If this job is not an MPI application, you may want to run it
directly (without mpirun) or via "srun --mpi=none", if available.
Killing remote processes...Signal 15 received.
Signal 15 received.
Signal 15 received.
Signal 15 received.
DONE


The same case, re-queued, will eventually run, but the timeouts account for 28 of 72 cases (more than 1 in 3).

Any idea what may cause this? The pgi/cdk was installed as provided by PG, not build from src

thx, S.
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6208
Location: The Portland Group Inc.

PostPosted: Tue Jun 05, 2012 3:28 pm    Post subject: Reply with quote

Hi S,

Not being an MVAPICH user myself, I asked some other PGI Application Engineers for ideas. Here's one response:
Quote:

I've run into many many Infiniband timeouts that were eventually resolved by tweaking the environment variables in some manner. Without very verbose output from the run, it's hard to give ideas as to what exactly to change.

As far as I know - when the user links with MPICH, they end up running over the Ethernet fabric.

When linking with MVAPICH, they will be using the Infiniband fabric. These messages indicate to me that there is possibly an issue with the Infiniband hardware or software. I'd start to trying to diagnose the fabric with some point to point pings and small group broadcasts. There are ways to turn on much more verbose messages to get a better idea of what is going on with the job launch and execution.

It can also be the case that MVAPICH environment variables can be set to better tune the fabric for the code the user is running. Some of these variables control the buffer space and the protocols used for different message sizes.

There have also been reports that using tcsh rather then bash to launch MVAPICH jobs works better - but unknown as to why that might be the case.

My suggestion would be to post on the MVAPICH website as they are much better equipped to help track down MVAPICH issues then we are.


- Mat
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6208
Location: The Portland Group Inc.

PostPosted: Wed Jun 06, 2012 8:41 am    Post subject: Reply with quote

Another engineer asked:

Quote:
Does he use ssh instead of rsh? That might be the problem. I get
timeout too. I changed to use -rsh.

Example,
mpirun_rsh -rsh -np 2 -hostfile mymachine myname.out

mpirun_rsh -show .... will show if it uses rsh or ssh.

Also if this is being run on a larger cluster, there is some chance that a node or two doesn't have the users sshkey installed. When a job lands on those nodes, it will not run as the user can't login, but when it lands on nodes that all allow passwordless login, it will work.
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Licenses and Installation All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group