Re: NAMD Run Instabilities

From: Marc Gordon (marcgrdn55_at_gmail.com)
Date: Mon Sep 10 2012 - 08:19:18 CDT

Hi Axel

Thanks for the response. My responses are below.

On Thu, Sep 6, 2012 at 3:27 PM, Axel Kohlmeyer <akohlmey_at_gmail.com> wrote:

> On Thu, Sep 6, 2012 at 3:15 PM, Marc Gordon <marcgrdn55_at_gmail.com> wrote:
> > Hi all
> >
> > I've recently started running my NAMD sims on a cluster on campus as
> opposed
> > to on my local machine so that I can do a whole lot more a whole lot
> > quicker.
> >
> > I uploaded the exact same config file and the various force field and
> > structure files onto the cluster and began some test runs just to be sure
> > that the results being produced were comparable to the stuff I was
> getting
> > on my local machine.
> >
> > Long story short the results I get from these runs are inconsistent.
> >
> > As an example I can start 2 identical runs and 1 falls over with some or
> > other instability error halfway through (usually "atoms moving too fast"
> or
> > "low global exclusion count" errors but it varies) while the other runs
> to
> > completion. Is this normal NAMD behaviour? Any idea what could be causing
>
> two possibilities: failing hardware causing bitflips or too
> aggressive simulation parameters.
>
> inconsistent behavior can be expected when running due to
> load balancing and particularly when running on nodes that
> are not use exclusively or have differences in hardware.
>

Sorry I should have clarified this in my initial mail: I am running this on
a single node at this point. I did try initially running the sims across
multiple nodes using both mpirun and charmmrun but I found that the
simulation times were ridiculous (hundreds upon hundreds of hours
estimated). Now I just run multicore NAMDv2.9 (running on 12 cores so I
pass it the +p12 option).

Granted I was only running on 8 cores but still I have noticed that when I
run NAMD across multiple nodes on smaller numbers of cores the runtimes
increase badly. I have asked others about this and suggestions have been
latency between nodes and that perhaps that loadbalancing phase that occurs
every few steps has too large of an overhead to be offset by so few cpus.

The nodes I was running it across initially were prettymuch identical in
terms of hardware. Still why would you say that load balancing can lead yo
inconsistent behaviour?

> > this? I have had the poor admin for the cluster installing multiple
> versions
> > of NAMD to test this (2.8, 2.9 nightly build, 2.9 stable precompiled, 2.9
> > stable compiled on the cluster) and they all seem to follow a similar
> > pattern (although the nightly build develops instabilities with every
> run so
> > far).
> >
> > To give you some background I am simulating a dissaccharide, no water or
> > anything just the dissacch. The only thing that could be said to be
> beyond
> > the "totally vanilla carb sim" label is that I am using the colvars
> module
> > for some metadynamics sampling on the dihedral bond between the residues.
>
> how many atoms does you system have? sounds like there would be very few.
> i assume you are not running parallel then, right?
>

Running in SMP mode with 12 cores. Takes a huge whack of time out of the
simulations which is great.

As to atom numbers I don't have the structure files in front of me at the
moment but if memory serves it was 47 atoms. (I can check this for you and
get a more precise number if this is important?). I'm guessing by the
standards you are used to in molecular dynamics that is a small number.

>
> > Any insights would be appreciated as right now I am tearing my hair out
> with
> > this problem.
>
> as usual, the usual reminder:
> it is difficult to discuss such problems on such a generic level.
> running an MD is a bit of a balancing act and debugging it is
> like being a medical doctor: the description of the patient is
> not necessary pointing to the real problem. the more details
> and tangible information is made available the better the diagnosis.
> otherwise there is not much more to say as in:
> "doctor, doctor. it hurts when i am doing this."
> "well, don't do it then."
>
> cheers,
> axel.
>
> --
> Dr. Axel Kohlmeyer akohlmey_at_gmail.com http://goo.gl/1wk0
> International Centre for Theoretical Physics, Trieste. Italy.
>

Yes I have noticed this. This is pretty much my first study involving
molecular dynamics and I found that on my local machine it took quite a
while before I found config settings that worked.

Any details you think might be helpful I will try to provide. Perhaps the
relevant portions of the config file:

binaryoutput yes
outputEnergies 100
dcdfreq 1000

exclude scaled1-4
1-4scaling 1
COMmotion no
dielectric 1.0

switching on
switchdist 9
cutoff 10
pairlistdist 12

reassignFreq 1000
reassignTemp 25
reassignIncr 25
reassignHold 300

colvars on
colvarsConfig PhiPsiMetaDaDGlc-a12-aLRha_charmm.txt

run 1500000000

This archive was generated by hypermail 2.1.6 : Tue Dec 31 2013 - 23:22:32 CST