Re: charmrun socket error

From: Chris Harrison (charris5_at_gmail.com)
Date: Sat Oct 11 2008 - 17:02:49 CDT

David,

What is the topology of the cluster? How many cores to a node, etc? Are
you using all cores within a node to calculate chares/patches?

Chris

On Thu, Oct 9, 2008 at 1:49 PM, David A. Horita <dhorita_at_wfubmc.edu> wrote:

>
> Hi,
> I've recently run into a problem running NAMD_2.6_Linux-amd64 on our
> cluster (using pbs/qsub), I've bee
> n getting
>
> ------------- Processor 5 Exiting: Caught Signal ------------
> Signal: bus error
> Suggestion: Check for misaligned reads or writes to memory.
> Charmrun: error on request socket--
> Socket closed before recv.
>
>
> when I run a big (120,000) atom job on more than one node.
> The job runs fine on one processor, and also runs on multiple processors on
> the same node, but crashes
> when running over multiple nodes/processors. Strangely, a slightly smaller
> system (96,000 atoms) runs
> fine distributed over 40 cpus and this system is the same as that with more
> water and ions (PBC water b
> ox). My namd config file is essentially the same, as well (number of
> steps, pme, gridsize=1, etc.).
> The log file doesn't help much, it gets to:
>
> Info: Entering startup phase 7 with 111708 kB of memory in use.
> Info: CREATING 15836 COMPUTE OBJECTS
> Info: NONBONDED TABLE R-SQUARED SPACING: 0.0625
> Info: NONBONDED TABLE SIZE: 769 POINTS
> Info: Entering startup phase 8 with 115568 kB of memory in use.
> Info: Finished startup with 115568 kB of memory in use.
> TCL: Running for 1000 steps
>
> and stops. If I use a lot of nodes, I get the same error with more than
> one Processor Exiting:
>
>
> ------------- Processor 10 Exiting: Caught Signal ------------
> Signal: bus error
> Suggestion: Check for misaligned reads or writes to memory.
> ------------- Processor 9 Exiting: Caught Signal ------------
> Signal: bus error
> Suggestion: Check for misaligned reads or writes to memory.
> ------------- Processor 31 Exiting: Caught Signal ------------
> Signal: bus error
> Suggestion: Check for misaligned reads or writes to memory.
> ------------- Processor 7 Exiting: Caught Signal ------------
> Signal: bus error
> Suggestion: Check for misaligned reads or writes to memory.
> Charmrun: error on request socket--
> Socket closed before recv.
>
> Any ideas on what causes this? I can run the NAMD_2.6_Linux-amd64-TCP
> version, but it only pushes the
> CPUs at 30-50%, although it doesn't crash.
>
> Thanks,
>
> Dave
>
>

-- 
Chris Harrison, Ph.D.
Theoretical and Computational Biophysics Group
NIH Resource for Macromolecular Modeling and Bioinformatics
Beckman Institute for Advanced Science and Technology
University of Illinois, 405 N. Mathews Ave., Urbana, IL 61801
char_at_ks.uiuc.edu                            Voice: 217-244-1733
http://www.ks.uiuc.edu/~char               Fax: 217-244-6078

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:49:58 CST