PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

Problem running mpirun from head node in cluster
Goto page 1, 2  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Debugging and Profiling
View previous topic :: View next topic  
Author Message
adityak



Joined: 19 Mar 2007
Posts: 10

PostPosted: Thu Mar 22, 2007 8:59 am    Post subject: Problem running mpirun from head node in cluster Reply with quote

Hi,
I went ahead and chose the Linux Red Hat version 4 for the operating system although there is a CentOS running (this is another topic in the forum). In this configuration I chose the head node to be a compuatation node. Since the head node has 4 processors, the mpirun -np x mpihello works fine for upto 4. After that (it needs to communicate with other nodes) it just hangs. In my last installation I had not used the head node for computation and I could not do mpirun from the head node at all.

Now if I log on to another node I can do an mpirun -np x mpihello where x works fine upto the max number of processors on cluster.

So why is it that if mpirun -np x mpihello is invoked from the head node it cannot contact other nodes but vice versa works? (I am using rsh as the communication protocol between nodes)

Also note that something like mpirun -np 10 hostname works fine. So the problem appears to be when I make the MPI calls in mpihello.
Back to top
View user's profile
hongyon



Joined: 19 Jul 2004
Posts: 551

PostPosted: Thu Mar 22, 2007 11:16 am    Post subject: Reply with quote

Dear adityak,

CentOS is based on Red Hat. If you have CentOS 4, then Red Hat 4 should be your choice.

For MPI, there are a few questions regarding how your MPI installation.

1) Did you install as root or non-root?

2) Where did you install, local area or shared file where slave nodes can see?

3) Assume you run MPICH1: What is in mpi/mpich/share/machines.LINUX. They should have the names of all the nodes you want to run, otherwise you will need to edit it and add hostnames of slave nodes or provide machinefile. If you are root, the install script should handle this and it modifies /etc/hosts file as well.

4) What is nolocal in mpi/mpich/bin/mpirun.arg set to? I would recommend that you allow the head node to be part of the computation. The user can always have a choice to not run on head node. If allow, nolocal should be set to 0. If nolocal=1, you cannot run on local machine.

5) What is RSHCOMMAND in mpi/mpich/bin/mpirun set to?
It normally sets to either /usr/bin/ssh or /usr/bin/rsh.

6) Use the full path of ssh or rsh from 5), assume rsh/ssh is from /usr/bin, try this on head node:

/usr/bin/ssh headnode date
OR
/usr/bin/rsh headnode date

headnode here is the name of your head node. This name should be the same appears in machines.LINUX.

Also try to and from head node and slave nodes, and among slave nodes themselves. It is required for MPI that this must work.

Do they work? If not, then there is a fundamental problem with the cluster that needs to be fixed first.

Hongyon
Back to top
View user's profile
adityak



Joined: 19 Mar 2007
Posts: 10

PostPosted: Thu Mar 22, 2007 12:21 pm    Post subject: Reply with quote

Hi Hongyon,

Ok, here's the problem. I realised that there are two mpirun scripts and the one that I was using is in the /usr/bin/. If I use the script from /opt/pgi/linux86-64/6.2/mpi/mpich/bin I get the following error

0 - MPI_INIT : MPIRUN chose the wrong device ch_p4; program needs device
ch_ipath.

Both these scripts are different too. Any ideas on this?

Also the date command works in all combinations you mentioned using both rsh and ssh. In think we can discuss more on this after I know which mpirun script to use.

Thanks
Aditya
Back to top
View user's profile
hongyon



Joined: 19 Jul 2004
Posts: 551

PostPosted: Thu Mar 22, 2007 1:19 pm    Post subject: Reply with quote

Hi Aditya,

Please use mpirun from /opt/pgi/linux86-64, also make sure to set environment variable PGI to /opt/pgi. The PGI MPI scripts rely on it. That could be the reason you get an error.

Here is an example for setting 64-bit environment for running MPICH1 for csh. Assuming you install PGI CDK in /opt/pgi.

% setenv PGI /opt/pgi
% setenv PATH /opt/pgi/linux86-64/6.2/bin:$PATH
% setenv PATH /opt/pgi/linux86-64/6.2/mpi/mpich/bin:$PATH
% which mpirun # check which mpirun, should come from /opt/pgi.
% which pgf90 # check PGI compiler.

Make sure that you recompile and run a program. Also check machines.LINUX if you don't provide machinefile when run your program.

I am not sure how /usr/bin/mpirun got there, it could be that somebody installed it and possibly it was configured to use shared memory. That's why you got an error regarding shared memory.

As Mat mentioned, PGI CDK is not configured to use shared memory. We configured it to use device ch_p4. This could be one of a few differences in both mpirun scripts.


Hongyon
Back to top
View user's profile
adityak



Joined: 19 Mar 2007
Posts: 10

PostPosted: Thu Mar 22, 2007 3:02 pm    Post subject: Reply with quote

Hi Hongyon,
I did try setting the path variables as you directed and running the compilers and mpirun from /opt/pgi, but as I mentioned I get the error

0 - MPI_INIT : MPIRUN chose the wrong device ch_p4; program needs device
ch_ipath.

It ask for ch_ipath device. Since the mpi library provided with CDK is in built using the -ch_p4 option, will this not work for an Infinipath cluster? Is getting a library with -ch_ipath is my only option? If so, how can I get that?

Thanks
Aditya
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Debugging and Profiling All times are GMT - 7 Hours
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group