Re: socket closed error

From: Brian Bennion (brian_at_youkai.llnl.gov)
Date: Fri Mar 25 2005 - 16:25:48 CST

interesting problem...
So the node crashed while running namd?
can you still ping the problem node after namd2 crashes?
or you do you have to powercycle the node after it crashes?

Is this cluster new or has this behavior started recently?

Regards
Brian

On Fri, 25 Mar 2005, brady chang wrote:

> Hi all, I'm having a very perculiar problem with NAMD.
>
> I was wondering if anybody have see this?
>
> Platform Rocks 3.3:
> dual xeon; ASUS PRDL533 MOBO.
>
> command:
> #!/bin/csh -f
>
> setenv CONV_RSH ssh
>
> ~~/apps/NAMD/NAMD_2.5_Linux-i686-TCP/charmrun
> ~~/apps/NAMD/NAMD_2.5_Linux-i686-TCP/namd2 +p26 ++verbose ++nodelist
> ./.nodelist md_1ns.inp >logmd
>
> after running for ~12 hours I get
>
> Charmrun: error on request socket--
> Socket closed before recv.
>
> and brought the node down
>
> modified the command to exclude the downed node in my .nodelist.
> then after running for ~ 4 hours I got the same error and brought down
> another node.
> So I'm running it again excluding the downed nodes.
>
> temperature is normal, load is average. I'm not seeing anything that could
> cause the node to go down.
>

************************************************
  Brian Bennion, Ph.D.
  Bioscience Directorate
  Lawrence Livermore National Laboratory
  P.O. Box 808, L-448 bennion1_at_llnl.gov
  7000 East Avenue phone: (925) 422-5722
  Livermore, CA 94550 fax: (925) 424-6605
************************************************

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:39:17 CST