LDB problem on Dual Atlon64 X2

From: Sebastian Wasilewski (Sebastian.Wasilewski_at_fizyka.umk.pl)
Date: Wed May 17 2006 - 17:07:17 CDT

Dear all,

I have big problem probably with Load Balancer in NAMD 2.6b1 (TCP and
UDP). It runs on 8 workstations based on Dual Core Athlon64 X2 under
Fedora Core4. NAMD jobs started on that "cluster" crashed after a few
minutes of computing. I've been trying to make some debugging of NAMD
LDB but unsuccessfully. For example, when I run my job using command:

charmrun ++p 14 ++nodelist ~/.nodelist $NAMDBIN/namd2 +LBNoBackground
+LBSameCpus +LBDebug 10 +giga +LBUseCpuTime INFILE_1 >debug.out

job crashes printing that lines on the end of output:

(...)
WRITING COORDINATES TO DCD FILE AT STEP 16000
TIMING: 16000 CPU: 136.965, 0.00905656/step Wall: 219.452,
0.0134323/step, 7.40269 hours remaining, 16744 kB of memory in use.
ENERGY: 16000 1549.7960 2816.7548 2201.7645
178.1802 -10088.7153 -1124.5583 128.0277
0.0000 3638.1699 -700.5804 214.1299 -676.0240
      -675.6231 215.3421

[CentralLB] Load balancing step 11 starting at 6810.307654 in PE0
[0] n_obj:388 migratable:368 ncom:0
LDB: LOAD: AVG 0.630876 MAX 0.877152 MSGS: TOTAL 88 MAXC 11 MAXP 7 None

When it happens only one NAMD process is running (at 100% CPU) - all
others processes are in Sleep state. I've tried increasing and
decreasing number of processes but without any successes.

I've updated BIOS on each of them, and all updates from Fedora Project
are installed.

Is someone know, what could it be?

Thanks a lot,
Sebastian


This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:42:04 CST