charmrun socket error

From: David A. Horita (dhorita_at_wfubmc.edu)
Date: Thu Oct 09 2008 - 13:49:21 CDT

Hi,
I've recently run into a problem running NAMD_2.6_Linux-amd64 on our cluster (using pbs/qsub), I've bee
n getting

     ------------- Processor 5 Exiting: Caught Signal ------------
     Signal: bus error
     Suggestion: Check for misaligned reads or writes to memory.
     Charmrun: error on request socket--
     Socket closed before recv.

when I run a big (120,000) atom job on more than one node.
The job runs fine on one processor, and also runs on multiple processors on the same node, but crashes
when running over multiple nodes/processors. Strangely, a slightly smaller system (96,000 atoms) runs
fine distributed over 40 cpus and this system is the same as that with more water and ions (PBC water b
ox). My namd config file is essentially the same, as well (number of steps, pme, gridsize=1, etc.).
The log file doesn't help much, it gets to:

     Info: Entering startup phase 7 with 111708 kB of memory in use.
     Info: CREATING 15836 COMPUTE OBJECTS
     Info: NONBONDED TABLE R-SQUARED SPACING: 0.0625
     Info: NONBONDED TABLE SIZE: 769 POINTS
     Info: Entering startup phase 8 with 115568 kB of memory in use.
     Info: Finished startup with 115568 kB of memory in use.
     TCL: Running for 1000 steps

and stops. If I use a lot of nodes, I get the same error with more than one Processor Exiting:

------------- Processor 10 Exiting: Caught Signal ------------
Signal: bus error
Suggestion: Check for misaligned reads or writes to memory.
------------- Processor 9 Exiting: Caught Signal ------------
Signal: bus error
Suggestion: Check for misaligned reads or writes to memory.
------------- Processor 31 Exiting: Caught Signal ------------
Signal: bus error
Suggestion: Check for misaligned reads or writes to memory.
------------- Processor 7 Exiting: Caught Signal ------------
Signal: bus error
Suggestion: Check for misaligned reads or writes to memory.
Charmrun: error on request socket--
Socket closed before recv.

Any ideas on what causes this? I can run the NAMD_2.6_Linux-amd64-TCP version, but it only pushes the
CPUs at 30-50%, although it doesn't crash.

Thanks,

Dave

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 05:21:22 CST