From: itvasile_at_nipne.ro
Date: Sun Oct 04 2009 - 03:34:35 CDT
Hello Jorgen,
You didn't said anything about the nodefile for the parallel job. In your
case:
Charmrun> using /home/user/.nodelist as nodesfile
Is this file containing the IP addresses or hostnames of the remote
machines? I think that it contains only localhost a couple of times.
Anyway, please check your nodes file in PBS. Mine is located in
/var/spool/torque/server_priv/nodes. You should have here listed all the
machines on which you have PBS running (pbs_mom) and ssh configured
passwordless with your username.
machine1.domain np=4
machine2.domain np=4
...
If your PBS is configured correctly, check that the nodefile for charmrun
has listed the machines on which you want to run namd. It should look
like:
group main
host machine2.domain
host machine2.domain
host machine2.domain
host machine2.domain
host machine1.domain
host machine1.domain
host machine1.domain
host machine1.domain
And, ofcourse, you should be able to ssh on these machines passwordless
and have namd2 located on each server in the same path.
HTH,
Ionut
_____________________________________________________________________
Dipl. Eng. Drd. Ionut VASILE
Junior Researcher
Horia Hulubei National Institute of Physics and Nuclear Engineering
ROMANIA
http://www.nipne.ro
Jorgen Simonsen said:
> Hi all,
> I am trying to run namd in parallel but it is not working. I have
> downloaded
> the NAMD_2.7b1_Linux-x86_64-TCP and using the binary namd2 and charmrun
> from
> here. If I run a single job no problems
>
> namd2 conf.conf > log.log
>
> it runs and produces the expected results. If I on the otherhand submit
> the
> job asking for 8 cpus
>
> #!/bin/sh
> ### Note: No commands may be executed until after the #PBS lines
> ### Job name
> #PBS -N test
> ### Output files
> #PBS -e test.err
> #PBS -o test.log
> ### Queue name (small, medium, long, verylong)
> #PBS -q small
> ### Number of nodes
> #PBS -l nodes=2:ppn=4
> # Define number of processors
> NPROCS=`wc -l < $PBS_NODEFILE`
> echo This job has allocated $NPROCS nodes
>
> # Go tho the directory from where the job was submitted (initial directory
> is $HOME)
> echo Working directory is $PBS_O_WORKDIR
> cd $PBS_O_WORKDIR
> ./../../Programs/NAMD/charmrun ++local ../../../Programs/NAMD/namd2
> +p$NPROCS min.conf > data.log
>
> it starts up 16 threads on one processor which is of course a waste. If I
> remove the ++local and add ++verbose
> Charmrun> charmrun started...
> Charmrun> using /home/user/.nodelist as nodesfile
> Charmrun> remote shell (localhost:0) started
> Charmrun> remote shell (localhost:1) started
> Charmrun> remote shell (localhost:2) started
> Charmrun> remote shell (localhost:3) started
> Charmrun> node programs all started
> connect to address 127.0.0.1: Connection refused
> connect to address 127.0.0.1: Connection refused
> trying normal rsh (/usr/bin/rsh)
> connect to address 127.0.0.1: Connection refused
> connect to address 127.0.0.1: Connection refused
> trying normal rsh (/usr/bin/rsh)
> connect to address 127.0.0.1: Connection refused
> connect to address 127.0.0.1: Connection refused
> trying normal rsh (/usr/bin/rsh)
> connect to address 127.0.0.1: Connection refused
> connect to address 127.0.0.1: Connection refused
> trying normal rsh (/usr/bin/rsh)
> localhost.localdomain: Connection refused
> localhost.localdomain: Connection refused
> localhost.localdomain: Connection refused
> localhost.localdomain: Connection refused
> Charmrun> Error 1 returned from rsh (localhost:0)
>
> What is wrong and how can I fix this. Thanks in advance
>
> Best
>
> Jorgen
>
This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:53:20 CST