namd and node freeze

From: Brady Chang (bchang_at_atipa.com)
Date: Tue Apr 05 2005 - 03:39:26 CDT

Next message: Vani Krishna: "FEP output in NAMD"
Previous message: Nicholas M Glykos: "Re: RATTLE and DNA"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

Hi all, I was wonder if anybody has experienced or heard of the problem I'm running into:
platform Info:
Dual Intel Xeon with Intel Corp. 82540EM Gigabit Ethernet Controller (rev 2).
Ethernet controller: Intel Corp. 82544GC Gigabit Ethernet Controller (LOM) (rev 2).
Rocks 3.3
Namd 2.5 charm++ 5.8 compiled from source.

compute nodes are on the 64bit NIC 82544GC and head node is on 32bit NIC 82540EM.

ran a namd job with 1000000 steps.
I can provide the input file upon request.
ran on 16 node cluster
a node c0-7 will go down after 7 hours of running the job.
message on the head node is too generic:
Charmrun: error on request socket--
Socket closed before recv.

start the job once again excluding the downed node will run for 14 hrs or so and another node c0-2 will go down.

Then I can run the job pass the crash point.(excluding the downed nodes)

currently, I'm running the job on the nodes c0-2 and c0-7 (crashed before) and it's been running for over 12 hrs.

If 2 node charmrun runs for 24 hrs without crash. I'm thinking it might be related to the network loads. 16 nodes(32 p) messaging generates more traffi then the 2 nodes(4p).

thank you in advance for your help

Brady

Next message: Vani Krishna: "FEP output in NAMD"
Previous message: Nicholas M Glykos: "Re: RATTLE and DNA"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:40:38 CST