Re: NAMD 2.7b1 on Ranger/MPI Problems?

From: Jeff Wereszczynski (jmweresz_at_mccammon.ucsd.edu)
Date: Mon Apr 20 2009 - 15:56:18 CDT

Hi Lei,
You need to change "module load mvapich" to "module load mvapich-old" in
your job script. However I believe you can't just use the same
namd2 executable, you will need to recompile with the "mvapich-old" module
loaded. When I did this it wasn't too hard, charm++, fftw2, and tcl are
already installed (the first two are modules you can add, and tcl is located
under /usr/include and /usr/lib64).

Jeff

2009/4/20 Lei Shi <les2007_at_med.cornell.edu>

> Hi, Jeff
>
> Do we need to change the "module load mvapich" in the submission
> script? There seems no "mvapich_old" module.
>
> Thanks.
> Lei
>
> 2009/4/17 Jeff Wereszczynski <jmweresz_at_umich.edu>:
> > Hi all,
> > Just to follow up on this I believe I have solved the problem. On a
> > recommendation from the people at TACC I recompiled NAMD with the
> > 'mvapich-old' module (instead of 'mvapich') and it now appears to work.
> > Jeff
> >
> > On Fri, Apr 17, 2009 at 12:12 AM, Haosheng Cui <haosheng_at_hec.utah.edu>
> > wrote:
> >>
> >> Hello all:
> >>
> >> I do have the same problem. The job usually dies after 1000 steps. If I
> >> restart the job several (3-5) times, it may go through once and runs
> >> successfully for 24 hours. Seems it only happen to the big systems (for
> mine
> >> is ~800k). The problem occurs since the beginning of 2009. Tried the
> same
> >> job on Kraken, it works fine most of the time. I have already asked tacc
> for
> >> help, but seems not helping.
> >>
> >> Thanks,
> >> Haosheng
> >>
> >>
> >> Quoting Jeff Wereszczynski <jmweresz_at_umich.edu>:
> >>
> >>> Hi All,
> >>> I have a system of ~490k atoms I am trying to run on Ranger, however it
> >>> will
> >>> run for only 500-2000 steps before dying. Nothing of interest is
> printed
> >>> in
> >>> the log file:
> >>>
> >>> WRITING EXTENDED SYSTEM TO RESTART FILE AT STEP 2000
> >>> Signal 15 received.
> >>> Signal 15 received.
> >>> Signal 15 received.
> >>> TACC: MPI job exited with code: 1
> >>> TACC: Shutting down parallel environment.
> >>> TACC: Shutdown complete. Exiting.
> >>>
> >>>
> >>> Whereas in the job output/error file I get this:
> >>>
> >>> TACC: Done.
> >>> 193 - MPI_IPROBE : Communicator argument is not a valid communicator
> >>> Special bit pattern b5000000 in communicator is incorrect. May
> indicate
> >>> an
> >>> out-of-order argument or a freed communicator
> >>> [193] [] Aborting Program!
> >>> Exit code -3 signaled from i182-206.ranger.tacc.utexas.edu
> >>> Killing remote processes...Abort signaled by rank 193: Aborting
> program
> >>> !
> >>> MPI process terminated unexpectedly
> >>> DONE
> >>>
> >>> Here is my job script:
> >>>
> >>> #!/bin/bash
> >>> #$ -V # Inherit the submission environment
> >>> #$ -N namd # Job Name
> >>> #$ -j y # combine stderr & stdout into stdout
> >>> #$ -o namd # Name of the output file (eg.
> myMPI.oJobID)
> >>> #$ -pe 16way 256 # Requests 64 cores/node, 64 cores total
> >>> #$ -q normal # Queue name
> >>> #$ -l h_rt=2:00:00 # Run time (hh:mm:ss) - 1.5 hours
> >>>
> >>> module unload mvapich2
> >>> module unload mvapich
> >>> module swap pgi intel
> >>> module load mvapich
> >>>
> >>> export VIADEV_SMP_EAGERSIZE=64
> >>> export VIADEV_SMPI_LENGTH_QUEUE=256
> >>> export VIADEV_ADAPTIVE_RDMA_LIMIT=0
> >>> export VIADEV_RENDEZVOUS_THRESHOLD=50000
> >>>
> >>> ibrun tacc_affinity
> >>> /share/home/00288/tg455591/NAMD_2.7b1_Linux-x86_64/namd2
> >>> namd.inp >namd.log
> >>>
> >>> Any ideas what I might be doing wrong? I would guess from the error
> >>> message
> >>> its some sort of MPI problem. I've tried varying the number of
> >>> processors
> >>> (from 64 to 1104), editing out the "export ...." lines that control MPI
> >>> parameters, and taken out the tacc_infinity part but nothing seems to
> >>> help .
> >>> I've never had these problems with smaller systems. Has anyone else
> had
> >>> these sort of issues? Any suggestions how to fix them?
> >>>
> >>> Thanks,
> >>> Jeff
> >>>
> >>
> >>
> >
> >
>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:50:47 CST