From: Brady Chang (bchang_at_atipa.com)
Date: Tue Apr 05 2005 - 03:39:26 CDT
Hi all, I was wonder if anybody has experienced or heard of the problem I'm running into:
Dual Intel Xeon with Intel Corp. 82540EM Gigabit Ethernet Controller (rev 2).
Ethernet controller: Intel Corp. 82544GC Gigabit Ethernet Controller (LOM) (rev 2).
Namd 2.5 charm++ 5.8 compiled from source.
compute nodes are on the 64bit NIC 82544GC and head node is on 32bit NIC 82540EM.
ran a namd job with 1000000 steps.
I can provide the input file upon request.
ran on 16 node cluster
a node c0-7 will go down after 7 hours of running the job.
message on the head node is too generic:
Charmrun: error on request socket--
Socket closed before recv.
start the job once again excluding the downed node will run for 14 hrs or so and another node c0-2 will go down.
Then I can run the job pass the crash point.(excluding the downed nodes)
currently, I'm running the job on the nodes c0-2 and c0-7 (crashed before) and it's been running for over 12 hrs.
If 2 node charmrun runs for 24 hrs without crash. I'm thinking it might be related to the network loads. 16 nodes(32 p) messaging generates more traffi then the 2 nodes(4p).
thank you in advance for your help
This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:39:18 CST