namd and node freeze

From: Brady Chang (bchang_at_atipa.com)
Date: Tue Apr 05 2005 - 03:39:26 CDT

Hi all, I was wonder if anybody has experienced or heard of the problem I'm running into:
platform Info:
Dual Intel Xeon with Intel Corp. 82540EM Gigabit Ethernet Controller (rev 2).
Ethernet controller: Intel Corp. 82544GC Gigabit Ethernet Controller (LOM) (rev 2).
Rocks 3.3
Namd 2.5 charm++ 5.8 compiled from source.
 
compute nodes are on the 64bit NIC 82544GC and head node is on 32bit NIC 82540EM.
 
ran a namd job with 1000000 steps.
I can provide the input file upon request.
ran on 16 node cluster
a node c0-7 will go down after 7 hours of running the job.
message on the head node is too generic:
Charmrun: error on request socket--
Socket closed before recv.

 
start the job once again excluding the downed node will run for 14 hrs or so and another node c0-2 will go down.
 
 Then I can run the job pass the crash point.(excluding the downed nodes)
 
currently, I'm running the job on the nodes c0-2 and c0-7 (crashed before) and it's been running for over 12 hrs.
 
If 2 node charmrun runs for 24 hrs without crash. I'm thinking it might be related to the network loads. 16 nodes(32 p) messaging generates more traffi then the 2 nodes(4p).
 
thank you in advance for your help
 
Brady
 

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:40:38 CST