From: Axel Kohlmeyer (akohlmey_at_gmail.com)
Date: Sun Feb 14 2010 - 15:16:22 CST

On Sun, 2010-02-14 at 14:13 -0500, Robert Wohlhueter wrote:

bob,

your problem description is pretty difficult to read. particularly
the many commented out lines are quite confusing. it looks like you
have to talk to your system admins.

you are trying to use openmpi and its mpirun to launch a parallel
job, but you have a non-openmpi NAMD binary, i.e. one that needs
to be launched via charmrun. combining those two will not work.

you either have to find - with the help of your sysadmins - a
way to launch a charmrun parallel executable and your SGE batch
environment will have to be adjusted for that (i.e. you'd need
a -pe charmrun option), or you will have to compile namd (and charm++)
from source using the local OpenMPI installation instead of plain
tcp/ip and then you won't need to go through charmrun, but use
the openmpi provided mpirun/mpiexec.

since your batch system seems to be configured in a rather paranoid
way, you cannot launch (or rather fork) any new processes.

HTH,
   axel.

> I'm trying to move from a dual-core Linux machine to a 32-node x 2 cpu
> Mac PPC cluster (running namd2-2.7b2). The config files I'm using work
> on the Linux computer. It would be too verbose to relate all the
> permutations I've tried (none of which have been successful), so I cite
> just the most recent failure:
>
> Ultimately I submit the job to the cluster with the command: `qsub -cwd
> -pe openmpi 8 runNAMD`, where 8 is the number of processors called for
> in this particular run. the "runNAMD" script is:
>
> *********************************************************************************************
>
> #!/bin/csh
>
> setenv NSLOTS 8
>
> setenv MachineFile "/Volumes/RAID/common/hostfile"
>
> setenv NAMDpath "/common/Applications/NAMD_2.7b2_MacOSX-PPC"
>
> setenv PATH ${PATH}:$NAMDpath
>
> setenv NAMDcmd "$NAMDpath/charmrun +p$NSLOTS ++nodelist ./mach_nodes
> $NAMDpath/namd2 2htq_box_test.config"
>
> # setenv NAMDcmd "$NAMDpath/namd2 +p$NSLOTS 2htq_box_test.config" gives
> error, must use charmrun
>
> # setenv NAMDcmd "$NAMDpath/namd2 2htq_box_test.config"
>
> echo " "
> echo "Running NAMD2 ( via $NAMDpath/charmrun ) ... "
> echo " "
> date
>
> # $NAMDpath/charmrun $NAMDpath/namd2 -machinefile $MachineFile -np
> $NSLOTS \
> # $NAMDpath/namd2 -machinefile $MachineFile -np $NSLOTS \
>
> /common/ompi11/bin/mpiexec -machinefile $MachineFile -np $NSLOTS $NAMDcmd
>
>
> echo " "
> echo "Done with NAMD"
> date
> echo " "
>
> **************************************************************************
>
> An excerpt (several repetitions of such messages) from the
> "runNAMD.e1095" redirected standard error file is:
>
> Charmrun> Error 128 returned from rsh (node001.cluster.private:0)
> /common/sge/util/arch: fork: Resource temporarily unavailable
> /bin/sh: fork: Resource temporarily unavailable
> /common/sge/util/arch: fork: Resource temporarily unavailable
> bash: fork: Resource temporarily unavailable
> Charmrun> Error 128 returned from rsh (node001.cluster.private:0)
> ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for
> 'defaults'
> /bin/sh: fork: Resource temporarily unavailable
> bash: fork: Resource temporarily unavailable
> bash: line 1: =/common/sge/lib/darwin:$: No such file or directory
> Received disconnect from 192.168.2.1: 2: fork failed: Resource
> temporarily unavailable
> bash: fork: Resource temporarily unavailable
> bash: fork: Resource temporarily unavailable
> ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for
> 'defaults'
> bash: fork: Resource temporarily unavailable
> /common/sge/util/arch: fork: Resource temporarily unavailable
> bash: line 1: =/common/sge/lib/darwin:$: No such file or directory
> bash: fork: Resource temporarily unavailable
> bash: fork: Resource temporarily unavailable
> bash: fork: Resource temporarily unavailable
> /common/sge/util/arch: fork: Resource temporarily unavailable
> bash: fork: Resource temporarily unavailable
> /common/sge/util/arch: fork: Resource temporarily unavailable
> ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for
> 'defaults'
> bash: fork: Resource temporarily unavailable
> /bin/sh: fork: Resource temporarily unavailable
>
> **********************************************************
>
> Looking at section 15.2 of the "NAMD Users Guide" and the
> "CHARM++/Converse Installation and Usage", and trying several
> suggestions in them, hasn't helped. I concerned by a note on the
> website (../namd/2.6/ug/node43.html), which states that "..a parallel
> program depends on a platform-specific library such as MPI to launch.."
> and "..you will likely need to recompile NAMD and its underlying Charm++
> libraries to use these machines in parallel.."
>
> Succinctly put: Is there a hint or "recipe" of how to run namd2 on such
> a Mac cluster?
>
> Thanks,
>
> Bob Wohlhueter,
> Georgia State University, Dept. of Chemistry
>

-- 
Dr. Axel Kohlmeyer  akohlmey_at_gmail.com
http://sites.google.com/site/akohlmey/
Institute for Computational Molecular Science
Temple University, Philadelphia PA, USA.