From: Brian Bennion (brian_at_youkai.llnl.gov)
Date: Fri Mar 25 2005 - 16:25:48 CST
interesting problem...
So the node crashed while running namd?
can you still ping the problem node after namd2 crashes?
or you do you have to powercycle the node after it crashes?
Is this cluster new or has this behavior started recently?
Regards
Brian
On Fri, 25 Mar 2005, brady chang wrote:
> Hi all, I'm having a very perculiar problem with NAMD.
>
> I was wondering if anybody have see this?
>
> Platform Rocks 3.3:
> dual xeon; ASUS PRDL533 MOBO.
>
> command:
> #!/bin/csh -f
>
> setenv CONV_RSH ssh
>
> ~~/apps/NAMD/NAMD_2.5_Linux-i686-TCP/charmrun
> ~~/apps/NAMD/NAMD_2.5_Linux-i686-TCP/namd2 +p26 ++verbose ++nodelist
> ./.nodelist md_1ns.inp >logmd
>
> after running for ~12 hours I get
>
> Charmrun: error on request socket--
> Socket closed before recv.
>
> and brought the node down
>
> modified the command to exclude the downed node in my .nodelist.
> then after running for ~ 4 hours I got the same error and brought down
> another node.
> So I'm running it again excluding the downed nodes.
>
> temperature is normal, load is average. I'm not seeing anything that could
> cause the node to go down.
>
************************************************
Brian Bennion, Ph.D.
Bioscience Directorate
Lawrence Livermore National Laboratory
P.O. Box 808, L-448 bennion1_at_llnl.gov
7000 East Avenue phone: (925) 422-5722
Livermore, CA 94550 fax: (925) 424-6605
************************************************
This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:39:17 CST