Re: running namd in parallel

From: itvasile_at_nipne.ro
Date: Sun Oct 04 2009 - 03:34:35 CDT

Hello Jorgen,

You didn't said anything about the nodefile for the parallel job. In your
case:
Charmrun> using /home/user/.nodelist as nodesfile
Is this file containing the IP addresses or hostnames of the remote
machines? I think that it contains only localhost a couple of times.

Anyway, please check your nodes file in PBS. Mine is located in
/var/spool/torque/server_priv/nodes. You should have here listed all the
machines on which you have PBS running (pbs_mom) and ssh configured
passwordless with your username.

machine1.domain np=4
machine2.domain np=4
...

If your PBS is configured correctly, check that the nodefile for charmrun
has listed the machines on which you want to run namd. It should look
like:

group main
 host machine2.domain
 host machine2.domain
 host machine2.domain
 host machine2.domain
 host machine1.domain
 host machine1.domain
 host machine1.domain
 host machine1.domain

And, ofcourse, you should be able to ssh on these machines passwordless
and have namd2 located on each server in the same path.

HTH,
Ionut

_____________________________________________________________________
   Dipl. Eng. Drd. Ionut VASILE
   Junior Researcher
   Horia Hulubei National Institute of Physics and Nuclear Engineering
   ROMANIA
   http://www.nipne.ro

Jorgen Simonsen said:
> Hi all,
> I am trying to run namd in parallel but it is not working. I have
> downloaded
> the NAMD_2.7b1_Linux-x86_64-TCP and using the binary namd2 and charmrun
> from
> here. If I run a single job no problems
>
> namd2 conf.conf > log.log
>
> it runs and produces the expected results. If I on the otherhand submit
> the
> job asking for 8 cpus
>
> #!/bin/sh
> ### Note: No commands may be executed until after the #PBS lines
> ### Job name
> #PBS -N test
> ### Output files
> #PBS -e test.err
> #PBS -o test.log
> ### Queue name (small, medium, long, verylong)
> #PBS -q small
> ### Number of nodes
> #PBS -l nodes=2:ppn=4
> # Define number of processors
> NPROCS=`wc -l < $PBS_NODEFILE`
> echo This job has allocated $NPROCS nodes
>
> # Go tho the directory from where the job was submitted (initial directory
> is $HOME)
> echo Working directory is $PBS_O_WORKDIR
> cd $PBS_O_WORKDIR
> ./../../Programs/NAMD/charmrun ++local ../../../Programs/NAMD/namd2
> +p$NPROCS min.conf > data.log
>
> it starts up 16 threads on one processor which is of course a waste. If I
> remove the ++local and add ++verbose
> Charmrun> charmrun started...
> Charmrun> using /home/user/.nodelist as nodesfile
> Charmrun> remote shell (localhost:0) started
> Charmrun> remote shell (localhost:1) started
> Charmrun> remote shell (localhost:2) started
> Charmrun> remote shell (localhost:3) started
> Charmrun> node programs all started
> connect to address 127.0.0.1: Connection refused
> connect to address 127.0.0.1: Connection refused
> trying normal rsh (/usr/bin/rsh)
> connect to address 127.0.0.1: Connection refused
> connect to address 127.0.0.1: Connection refused
> trying normal rsh (/usr/bin/rsh)
> connect to address 127.0.0.1: Connection refused
> connect to address 127.0.0.1: Connection refused
> trying normal rsh (/usr/bin/rsh)
> connect to address 127.0.0.1: Connection refused
> connect to address 127.0.0.1: Connection refused
> trying normal rsh (/usr/bin/rsh)
> localhost.localdomain: Connection refused
> localhost.localdomain: Connection refused
> localhost.localdomain: Connection refused
> localhost.localdomain: Connection refused
> Charmrun> Error 1 returned from rsh (localhost:0)
>
> What is wrong and how can I fix this. Thanks in advance
>
> Best
>
> Jorgen
>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:53:20 CST