From: Cesar Luis Avila (cavila_at_fbqf.unt.edu.ar)
Date: Sat Aug 26 2006 - 16:40:15 CDT
There was a problem with load balancer on a previous version of NAMD
(2.6.B1) for which there was a workaround on NAMD's wiki. On 2.6.B2 it
seems to be solved. I have run NAMD_2.6b2_Linux-amd64-TCP on AMD64
dual-core nodes for a weak now and haven't experienced that problem. I
am still experiencing some problems which I think might be related to
charm++ or perhaps to the kernel itself. I suspect there is a problem
with memory management when using both processors of each node. I saw
these problems even on APOA1 simulation. Unfornately I don't know how to
track down the problem. For now I am running simulations using only one
processor on each node to test this hypothesis.
I am using Debian Cluster Components (DCC) with custum compiled kernel
Leandro Martínez escribió:
> Just for claryfing the problem a little bit more.
> Now I put the simulation to run on a single node (the
> master machine), which has two processors. It starts
> running fine, two jobs each one in one processor and
> using almost all cpu speed, as expected,
> but eventually it returned the message:
> Info: Adjusted background load on 1 nodes.
> And the simulation starts running on only one processor.
> Any clue on what may be going wrong?
> On 8/25/06, *Leandro Martínez* <leandromartinez98_at_gmail.com
> <mailto:leandromartinez98_at_gmail.com>> wrote:
> Hi all,
> I'm running a simulation with NAMD_2.6b2_Linux-amd64-TCP on
> a cluster of nine Athlon64 nodes (each processor has a dual
> core, so there are actually 18 processors). I'm having some
> strange problems with simulations I have already ran on several
> other machines, and I'm not being able to find a solution.
> Basically I start running the simulation and eventually it either
> stops without printing any error message or it eventually starts
> on only one processor apparently. The only message I have
> observed to be different from our previous runs is this one:
> Info: Adjusted background load on 11 nodes.
> That is printed the first time load balancing is performed. The
> error does not occur necessarily after that, on the other hand,
> but that may be part of the problem, since the simulation was
> set to be running on 18 processors (9 nodes).
> The only time I got an error message it was the one below, as you
> may note was printed after a quite long simulation time.
> The error is not easily reproducible, since it happens always
> but not every time at the same point of the simulation.
> Any help or idea will be appreciated.
> ENERGY: 644800 804.7671 2363.3700 1332.0255
> 131.9843 -201929.9812 17508.6136 0.0000
> 0.0000 32575.8361 -147213.3846 297.3932
> -147116.7637 -147117.4476 296.8970
> Stack Traceback:
>  /lib64/libc.so.6 [0x360b32f7c0]
>  _ZN11WorkDistrib12enqueueBondsEP12LocalWorkMsg+0x16 [0x727b16]
>  CkDeliverMessageFree+0x21 [0x785aab]
>  _Z15_processHandlerPvP11CkCoreState+0x455 [0x7850b5]
>  CsdScheduleForever+0xa2 [0x7f1752]
>  CsdScheduler+0x1c [0x7f1350]
>  _Z10slave_initiPPc+0x10 [0x4bb034]
>  _ZN7BackEnd4initEiPPc+0x28f [0x4bb019]
>  main+0x47 [0x4b697f]
>  __libc_start_main+0xf4 [0x360b31d084]
>  _ZNSt8ios_base4InitD1Ev+0x42 [0x4b2c9a]
This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:42:30 CST