RE: Running NAMD at TACC (Ranger)

From: Richard Swenson (swenson_at_hec.utah.edu)
Date: Thu Apr 03 2008 - 18:09:22 CDT

Jim, Namd list,

I have not yet had namd hang as Jim has experienced (I have only ran about
10 test runs, and most run great), but I have had other problems with namd
on ranger. I am using the binary found in
~tg455591/Linux-amd64-MVAPICH-icc/namd2, but wrote my own runscript (see
below). I have a 2.1 million atom system (restarted from a 64ns+ long
trajectory) that runs fine up to 1024 procs, but gives me a memory error
when I try to up the proc count to 2048. I get a "FATAL ERROR: Memory
allocation failed on processor 0" during phase 3. From what I can tell from
previous posts, this problem could be caused by a lack of memory, but the
fact that this system has no problem on fewer nodes makes me doubt this
cause.

*************************
runscript.sh:
#!/bin/bash

NAMD=/share/home/00288/tg455591/Linux-amd64-MVAPICH-icc/namd2
IBRUN=`which ibrun`

module unload mvapich2
module unload mvapich
module swap pgi intel
module load mvapich

$IBRUN VIADEV_RENDEZVOUS_THRESHOLD=5000 $NAMD $1 >> $2
*************************

*************************
submit script:
#!/bin/bash

qsub << ENDINPUT
#\$ -S /bin/bash
#\$ -V
#\$ -pe 16way 2048
#\$ -l_rt=24:00:00
#\$ -q normal
#\$ -M myemail
#\$ -m abe
#\$ -N Ranger_2048
#\$ -cwd
#\$ -j y

./runscript.sh config.namd log.log

ENDINPUT
*************************

> -----Original Message-----
> From: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] On Behalf
> Of Jim Pfaendtner
> Sent: Sunday, March 30, 2008 6:17 AM
> To: namd-l_at_ks.uiuc.edu
> Subject: namd-l: Running NAMD at TACC (Ranger)
>
> Hi,
>
> I am trying to run namd on the new Ranger cluster at TACC. I am using
> the script from the namd wiki that is listed here (~tg455591/
> NAMD_scripts/runbatch) to submit my jobs.
>
> I frequently am getting the following message in my log file:
>
> TACC: Starting up job 57208
> TACC: Setting up parallel environment for MVAPICH-1 mpirun.
> TACC: Setup complete. Running job script.
> TACC: starting parallel tasks...
>
> and then the system just hangs and the job doesn't run. There doesn't
> appear to be any rhyme or reason for why this happens as far as I can
> tell. I have tried to run with up to 30 nodes but as few as 10 or 15.
>
> Have other people had good luck with namd on Ranger? Any help would
> be appreciated.
>
> thanks,
> Jim

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:49:21 CST