CHARM/NAMD network problems on amd64 Clustermatic

From: Rene Salmon (rsalmon_at_tulane.edu)
Date: Wed May 11 2005 - 13:07:25 CDT

Hi List,

We are having some strange network problems with CHARM/NAMD on an AMD64
clustermatic 5 cluster.

On nodes that are not running NAMD jobs we this for the network stats

># bpsh 10 netstat -i
Kernel Interface table
Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR
Flg
eth0 1500 0221380022 0 0 0238341193 0 0 0
BMRU
eth1 1500 0 0 0 0 0 0 0 0 0
BMU
lo 16436 0 18 0 0 0 18 0 0 0
LRU

As you can see no error or dropped packets. But on nodes that are running
NAMD Jobs we get this:

># bpsh 4 netstat -i
Kernel Interface table
Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR
Flg
eth0 1500 023182578986 637615860 0 023839814024 1 0
0 BMRU
eth1 1500 0 0 0 0 0 0 0 0 0
BMU
lo 16436 02271084541 0 0 02271084541 0 0
0 LRU

Which shows lots of error and dropped packets. Is this normal? Some how
this is slowing the network down and causing NFS problems.

Any ideas?

Thank you
Rene

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:39:26 CST