NAMD Wiki: NamdTroubleshooting

  You are encouraged to improve NamdWiki by adding content, correcting errors, or removing spam.

So you tried running NAMD and ended up with an exploding molecule or a giant core dump? Rest assured that you're not alone. If you don't find the answer below or on NamdKnownBugs, or if the answer you do find isn't completely right, please add your experience to this page. -JimPhillips

Won't launch (network versions)

  • Try adding ++verbose to the charmrun command line to get more status updates.
  • Using full paths to the charmrun and namd2 binaries.
  • Run the namd2 binary directly to check for missing shared libraries (or use ldd). If it is not finding a library that exists, try putting it in your LD_LIBRARY_PATH.
  • Try adding ++local to the charmrun command line to run only on the current host.
  • Can you rsh or ssh to the other nodes without a password? Make it so!
  • If you need to use ssh, make sure to setenv CONV_RSH ssh.

Won't launch (ibverbs version)

  • Create a nodelist file since ++local does not work in the ibverbs version and will give the error "Charmrun: Bad initnode data length. Aborting" immediately on start.
  • Be sure that limit memorylocked (csh) or ulimit -l (bash) returns "unlimited", which is needed for RDMA. If not will need to add "ulimit -l unlimited >/dev/null 2>&1" to /etc/init.d/sshd, /etc/init.d/sgeexecd, etc. since these daemons start the shells that start the namd2 binary.

Won't launch (MPI versions)

I don't have much experience with these problems. If you know something please add...

Hangs while determining CPU topology

  • Try adding +skip_cpu_topology to the command line. If that works then the problem is likely contention at the DNS server. Possible solutions are to enable nscd (network caching daemon) or to add each host's IP address to its own /etc/hosts file and be sure the hosts line in /etc/nsswitch.conf has files as its first entry.

Lots of single-processor jobs

Are you trying to use mpirun on a non-MPI build of NAMD? If this line during startup:

Info: Based on Charm++/Converse 050612 for net-linux-icc

shows a "net-..." Charm platform then you need to use charmrun to launch NAMD. If your queuing system only supports MPI you can either recompile NAMD against your local MPI library or write a shell script to adapt NAMD to your queuing system. (A page on using NAMD with queuing systems would be nice.)

Runs really slow

Are you running out of physical memory? NAMD 2.5b2 has new pairlists and memory usage for a 100,000 atom system with a 12A cutoff can approach 300MB, and will grow with the cube of the cutoff. This extra memory is distributed across processors during a parallel run, but a single workstation may run out of physical memory with a large system. To avoid this, NAMD now provides a pairlistMinProcs config file option.

Are all of your processes running on the same node rather than across your cluster? The ++local option to charmrun will result in this behavior (which is desirable and foolproof if you only have one node and are using +p1 or +p2, but probably isn't what you wanted with +p16). Check with the "top" command.

Do you have multiple parallel NAMD jobs running on the same nodes? Run one at a time, or run each job on only half of the nodes.

VMD cannot open binary pdb files

If you get the error "PDB file 'your_file_name' contains no atoms." it is possible that you have changed the file name. For some reason the same binary file with a suffix of ".coor" will open (you must also specify the psf on the command line), while with a ".pdb" suffix will not open. The "pdb" file will produce the error above.

Atoms moving too fast

This generally indicates that the numerical integration algorithm has become unstable, typically because the calculated forces are too large for the selected timestep. Normally it is the forces that are incorrect or unrealistic rather than the timestep.

Always inspect the affected atoms and their environment in VMD. Remember to subtract 1 from NAMD's atom ID to get VMD's atom index (or use SERIAL in VMD instead of index).

If this happens right away in the simulation:

  • Is your periodic cell large enough for your system? Remember to add some extra space to your system's bounding box since each atom's radius will wrap around.
  • Did you minimize to eliminate bad contacts before starting dynamics?
  • Try looking at your input psf and pdb files in VMD. Check for:
    • Atoms with uninitialized coordinates at (0,0,0), often hydrogen atoms and easy to spot by the very long bonds connecting them to more reasonable locations.
    • Atoms with mismatched coordinates, caused by incorrect matches between atom names

in the original PDB file and the topology file. These will typically have abnormally long bonds and they will probably move several Angstroms at the start of minimization.

  • Try manually setting the MARGIN parameter in the Namd configuration file (start low, and increase it until the crash is avoided). This has been known to solve crashes on the first step.

If this happens when continuing a simulation:

  • If you're running constant pressure did you remember to use the extendedSystem parameter to load the .xsc file that corresponds to your restart coordinates?

It is possible to cause instability in an interactive (IMD) simulation by attempting to steer the simulation too enthusiastically. A gentle touch is required.

Constraint failure errors

"Constraint failure in RATTLE algorithm for atom ... simulation has become unstable" error messages are usually caused by the same factors as "Atoms moving too fast" above.

Bad global exclusion count errors

These errors generally indicate that two atoms that are (usually implicitly) excluded are on non-neighboring patches or are more than the cutoff distance apart. The situation is analogous for bond, angle, dihedral, or improper count errors.

This is often caused by similar input problems as in "Atoms moving too fast" above. In particular, atoms with uninitialized coordinates (0,0,0) may cause this error on the first timestep.

This will also happen if you have a periodic cell and the input coordinates to NAMD are wrapped on a per-atom basis so that there are bonds to hydrogen atoms extending across the cell (i.e., you load the psf and pdb in VMD and see very long bonds to hydrogens). You can work around this with the ancient "splitPatch position" option, which makes every atom its own hydrogen group (this option disables rigid bonds and hurts performance, so don't use it normally). It's better to just fix the input coordinates so that the bonds look normal in VMD.

Another possibility is that some atoms are specified more than once, for example one residue ends up in two chains. This is very difficult to spot in VMD as these duplicate atoms have the same coordinates. Duplicate atoms can be found with the following script:

$ while read i; do grep -e "$i" surf01_03.pdb; done < <( grep ATOM surf01_03.pdb | awk '{coord=substr($0,31,24);print coord}') > surf01_03.tmp

If the file surf01_03.tmp has more lines than the original surf01_03.pdb, duplicate atoms will appear with the wrong sequential number.

Bad global bond, angle, dihedral, etc. errors

These are usually caused by the same factors listed above for bad global exclusion count errors. Unlike exclusion count errors, however, they will not occur in serial simulations, and may remain hidden until a large enough parallel run is attempted

If extraBonds is being used to add structural restraints, then these errors likely indicate that the feature is being misused to connect atoms that are much further apart that a typical normal bond. All of the atoms in every bond, angle, dihedral, or improper must be contained 2 x 2 x 2 blocks of patches. Otherwise on a large enough run this error will occur.

Stray PME grid charges detected

This means that an atom has somehow moved far enough in a cycle that it corresponds to PME grid cells that were not expected based on NAMD's spatial decomposition scheme.

This may happen if you set stepspercycle too high (20 is reasonable), and tends to happen before you start seeing margin violations reported (because margin violation warnings are based on different criteria).

This error cannot happen in serial runs. It may happen if the periodic cell shrinks too far (although NAMD should detect this condition and exit). This may also be a sign of instability (and PME was just the first thing to break).

Floating exceptions, NaN or 99999999.9999 for energies

Some platforms tolerate division by zero and call it Inf like Solaris, some call it NaN (Not a Number) or NaNQ (Not a Number Quiet) like AIX, and some raise a floating point exception like Tru64.

This may indicate that some parameter is violating an assumption. For example, setting switchdist equal to cutoff will cause these errors. (If you don't want switching, just say switching off and all is well.)

These can also be generated if your simulation has become unstable, often caused by similar input problems as in "Atoms moving too fast" above.

Not all atoms have unique coordinates warnings

The "Not all atoms have unique coordinates" warning dates from when atoms closer than the lower limit of the interpolation table were skipped and assumed to be excluded (rather than checking if they were). Therefore, the most likely reason for extra exclusions to be calculated was that two atoms were right on top of each other, hence this was a useful warning. However, these warnings were generated routinely when non-physical methods like locally enhanced sampling and alchemical free energy were used, so these are now calculated based on excluded atoms in the pairlist.

For 1-4 modified exclusions, only excluded pairs within the cutoff distance are counted. Ditto for fully excluded pairs when full electrostatics is used (since a calculation is needed to correct PME for these pairs). However, when cutoff electrostatics is used then all fully excluded pairs in the pairlist are counted. Therefore, you can have exclusions counted twice if the distance between two of the closest periodic images is smaller than either a) cutoff plus the largest 1-4 length (~4A) or b) pairlistdist plus the largest 1-3 length (~3A).

If your cell is smaller than this and you get these warnings then NAMD is possibly ignoring nonbonded interactions between different images of the same molecule, which is not correct. Use a larger cell.

MStream checksum errors

Something is very wrong with your network or your NAMD binary.

Atoms collapsed into planes in dcd trajectory file

If contiguous sets of atoms are collapsed into the x=0, y=0, or z=0 planes for single frames in the output trajectory file, this means that random segments of the file have been overwritten with zeros. This has only been observed on the TeraGrid, and is thought to be due to problems in the filesystem. Please note any other instances of this problem here.

Core dumps during startup

This depends on when during startup the crash happens. NAMD should exit gracefully, but input isn't always checked too carefully. If you manage to work around this, please add the problem and solution here.

A repeatable core dump problem on launch (and workaround) (network versions)

If you are adding options to ssh to get it to launch passwordless, you can overflow an internal buffer and get mysterious seg faults on launch. For example

% charmrun ++remote-shell "ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o CheckHostIP=no -o BatchMode=yes" `which namd2` alanin.namd

Warning: Permanently added 'localhost' (RSA) to the list of known hosts.

Segmentation fault

This happens if you use the ++shell in the .nodelist file or add it to the charmrun command line.

As a workaround, write a small script called "myssh"

% cat myssh

#!/bin/sh

ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o CheckHostIP=no -o BatchMode=yes $*

Now use % charmrun ++remote-shell myssh ....

Another workaround is to download the latest Charm++ from http://charm.cs.uiuc.edu. The bug was fixed in development version of Charm++.

Random core dumps

This is probably NAMD's fault, and there's not much you can do, unless it's really due to:

  • bad compiler
  • bad memory
  • bad PCI bus
  • bad network hardware
  • bad power supply
  • bad building power
  • bad network drivers
  • bad Linux kernel
  • overclocked CPU
  • overheating CPU

If NAMD binaries and simulations that used to work start crashing one day, hardware problems are certainly a possibility. On a cluster the best strategy to diagnose hardware faults is to try removing nodes to see if a particular node is responsible for the crashes. To be efficient use a binary search, for example:

  1. crashes on entire cluster (nodes 0-31)
  2. runs on nodes 0-15, crashes on nodes 16-31
  3. runs on nodes 0-15,16-23, crashes on nodes 0-15,24-31
  4. crashes on nodes 0-23,24-27, runs on nodes 0-23,28-31
  5. runs on nodes 0-23,24-25,28-31, crashes on nodes 0-23,26-27,28-31
  6. runs on nodes 0-25,26,28-31, crashes on nodes 0-25,27,28-31
  7. therefore node 27 likely has hardware issues

If you're running a network version, try adding +netpoll to the namd2 command line. Add a note here with the NAMD version and platform and your OS version if this helped.

Repeatable core dumps

NAMD should exit gracefully, but sometimes we miss something. These should absolutely be reported since a repeatable bug is a fixable bug, and the same bug may be causing random crashes for less fortunate users.

A (somewhat) Repeatable Core Dump

I'm running NAMD on an 8-way ia32 smp box using MPI. About 1/2 of the time, the apoa1 test case runs to completion without issues, but in the other cases, I get a core dump. The most recent failure occurred after time step 241. In the coredumps I've looked at, the program received a SIGSEGV at one of two locations:

charmrun +p8 ++netpoll ./namd2 apoa1/apoa1.namd 
 ..
gdb namd2 core
...
Core was generated by `/home/tim/hde1/home/tim/test/NAMD/NAMD_2.5_Source/Linux-i686-MPI/./namd2 apoa1/'.
Program terminated with signal 11, Segmentation fault.
#0  0x08210cca in chunk_free (ar_ptr=0x838dda0, p=0x989d698)
    at memory-gnu.c:3268
#1  0x08210bda in mm_free (mem=0x4069) at memory-gnu.c:3191
#2  0x0821215d in free (mem=0x989d6a0) at memory.c:203
#3  0x0828508a in MPID_SHMEM_Eagern_unxrecv_start ()
#4  0x08278218 in MPID_IrecvContig ()
#5  0x0827a1cc in MPID_IrecvDatatype ()
#6  0x0827a0ad in MPID_RecvDatatype ()
#7  0x08260e16 in PMPI_Recv ()
#8  0x08250aca in PumpMsgs () at machine.c:418
#9  0x08250cbe in CmiGetNonLocal () at machine.c:616
#10 0x0825216c in CsdNextMessage (s=0xbfd94890) at convcore.c:967
#11 0x0825221f in CsdScheduleForever () at convcore.c:1024
#12 0x082521c7 in CsdScheduler (maxmsgs=16489) at convcore.c:990
#13 0x080c2392 in BackEnd::init (argc=2, argv=0xbfd94a84) at src/BackEnd.C:94
#14 0x080bf451 in main (argc=-1076278652, argv=0xbfd94a84) at src/mainfunc.C:34
...


The other class of stack traceback is more common, and it looks like:

(gdb) where
 #0  chunk_alloc (ar_ptr=0x838dda0, nb=9688) at memory-gnu.c:2966
 #1  0x0821026b in mm_malloc (bytes=1452254461) at memory-gnu.c:2848
 #2  0x08212124 in malloc (size=9684) at memory.c:196
 #3  0x08212319 in malloc_nomigrate (size=9684) at memory.c:265
 #4  0x08252bfc in CmiAlloc (size=9676) at convcore.c:1566
 #5  0x08250aaf in PumpMsgs () at machine.c:417
 #6  0x08250cbe in CmiGetNonLocal () at machine.c:616
 #7  0x0825216c in CsdNextMessage (s=0xbfeb9160) at convcore.c:967
 #8  0x0825221f in CsdScheduleForever () at convcore.c:1024
 #9  0x082521c7 in CsdScheduler (maxmsgs=1452254461) at convcore.c:990
 #10 0x080c2392 in BackEnd::init (argc=2, argv=0xbfeb9354) at src/BackEnd.C:94
 #11 0x080bf451 in main (argc=-1075080364, argv=0xbfeb9354) at src/mainfunc.C:34

I believe the problem is that the free-memory list is corrupt, causing the allocator and freeer(is that a word?) to abort.

Is this a known problem? I'm going to assume it isn't and will try to use the memory-parinoid.c routines to help debug this problem. Any comments are appreciated.

Tim Sirianni
+1 408 342 0339    tim <at> scalex86 <dot> org

Update: 7/9/2005

We tracked this down to a bug in MPICH 1.2.6 (it's also in the 1.2.7 version.) MPICH has a use-after-free bug that was causing the problem. Here's part of the email I sent the MPICH bugs list:

In mpid/ch_shmem/shmemneager.c, in function MPID_SHMEM_Eagern_save(), change these lines from:

#ifdef LEAVE_IN_SHARED_MEM
    rhandle->start        = address;
#else
    if (pkt->len > 0) {
        rhandle->start    = (void *)MALLOC( pkt->len );
        rhandle->is_complete  = 1;

to:

#ifdef LEAVE_IN_SHARED_MEM
    rhandle->start        = address;
#else
    if (len > 0) {  
        rhandle->start    = (void *)MALLOC( len );   
        rhandle->is_complete  = 1;


Note that "pkt" was potentially freed a few lines above, and so "pkt->len" is no longer a valid reference. The above changes two places where "pkt->len" was incorrectly being used, replacing it with "len" instead. "len" is set to "pkt->len" before "pkt" is freed.

With the above fix, I'm not seeing any SIGSEGV's in NAMD any more!

Multiplicity of Parameters

If an error in the form of:

Multiplicity of Paramters for diehedral bond CA CA CA CA of 1 exceeded

is received it is due most likely to a mistake in the the topology file for your residue of interest. In one case, a bond that was declared both a bond and double bond caused the above error.

"FATAL ERROR: Multiplicity of Paramters for improper bond CC CT1 OC OC of 1 exceeded"

This may also be the result of applying a PATCH more than once during psfgen. For example, due to a cut-and-paste error I applied 'first NTER; last CTER' in the segment definition and subsequently 'PATCH NTER P1:1' alongside the 'PATCH DISU' commands in the psfgen script:

topology bla.top ... segment P1 { pdb tmp_P1.pdb; first NTER; last CTER; auto angles dihedrals } ... PATCH NTER P1:1 ... coordpdb tmp_P1.pdb P1 ...

a repeatable core dump

I'm running namd on a Solaris 48-way Sparc SMP machine with MPI The code crashed in the half way when trying alanine case

Info: NAMD 2.6b1 for Solaris-Sparc-MPI
...
...

Info: 42 IMPROPER
Info: 21 VDW
Info: 0 VDW_PAIRS
Info: ****************************
Info: STRUCTURE SUMMARY:
Info: 66 ATOMS
Info: 65 BONDS
Info: 96 ANGLES
Info: 31 DIHEDRALS
Info: 32 IMPROPERS
Info: 0 EXCLUSIONS
Info: 195 DEGREES OF FREEDOM
Info: 55 HYDROGEN GROUPS
Info: TOTAL MASS = 783.886 amu
Info: TOTAL CHARGE = 8.19564e-08 e
Info: *****************************
Info: Entering startup phase 0 with 7160 kB of memory in use.
Job cre.9750 on frontend: received signal SEGV.
 

Anyone know how to solve it?

FATAL ERROR: child atom x bonded only to child H atoms

One cause of this error (other than having a bad input structure) is that NAMD can incorrectly assume, internally, that atoms with a weight less than 3.5 are actually hydrogen atoms. I ran into this problem when running a simulation using AMBER-formatted input files with dummy atoms of weight 3.

In NAMD 2.6, in Molecule.C, line 5539, changing:

if (atoms[i].mass <=3.5) {

to:

if (atoms[i].mass <=2.5) {

allowed my simulation to proceed. If you have non-hydrogens with weights lower than 2.5, you may need to use a lower value than that so they will be properly assigned. I don't know what will happen if you use a value less than 1.0 here, but I suspect NAMD may misassign real hydrogen atoms, which will probably break the simulation.