RE: cluseter node freezes while running namd 2.5/2.5b1

From: EPF (Esben Peter Friis) (epf_at_novozymes.com)
Date: Sun Oct 19 2003 - 14:00:00 CDT

Hi Richard

Two suggestions:

1) Some of these Intel NIC controllers have problems with high loads under Linux. Try to to put a separate network card in a couple of the boxes and configure them to use that instead. 3Com cards are known to work well, but also cards based on the Realtek 8139 chipset work fine with Linux (Although I've heard that their perfomance is poorer).

2) The AMD 760MPX chipset (apparantly used by your motherboard) has a serious error: The 32-bit PCI-bus cannot handle high loads (about >30 MB/s). In principle this should only be a problem with gigabit-NICs, but if you can, put all your PCI cards (including the above mentioned network cards) in the 64-bit slots - usually, 32-bit cards will fit here without problems.

We have a 24-node cluster (48 x AthlonMP 2000+ Asus A7M266-D motherboards, Intel Gigabit NICs, Netgear 24 port gigabit switch). Namd 2.5 works fine here - where older versions like 2.1 nearly always crash after less than 100000 simulation steps.

Best regards

Esben

-----Original Message-----
142From: Richard Brown
To: namd-l_at_ks.uiuc.edu
Sent: 19-10-03 05:48
Subject: namd-l: cluseter node freezes while running namd 2.5/2.5b1

I have been try to figure this out for the past two
month with no luck.

I have a 8-node PC cluster that consists of 16 athlon
mp2200+, msi k7d master-l mb, intel i82557/i82558
10/100 on-board lan, 500mb kingston ddr266 pc2100
unbuffered, 3com superstack III baseline 24 port
10/100 switch.

The cluster was built using oscar2.1/redhat7.3 w/ the
kernel update 2.4.20-20. namd used includes 2.5b1 and
the latest 2.5, both linux binary distributions and
source code builds. the simulation tested is apoa1
benchmark examples.

namd/apoa1 only runs w/o problems on a single cluster
node, either with one or two cpus. Every time it runs
on two or more nodes, either using one or two cpus
from each node, namd/apoa1 stops somewhere in the
middle of run. One of the nodes freezes and does not
respond to ping, ssh or the directly attached
keyboard. Most of the time there were no error
messages. A few times I received apic error or sorket
receive failure. I tried plugging a ps/2 mouse into
the nodes as some people suggested for a bug of the
motherboad but it did not help.

I don't know how to proceed from here. Any suggestions
would be appreciated.

Thanks,
Richard

__________________________________
Do you Yahoo!?
The New Yahoo! Shopping - with improved product search
http://shopping.yahoo.com

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:37:04 CST