Re: Re: Info: Adjusted background load on 11 nodes.

From: Cesar Luis Avila (cavila_at_fbqf.unt.edu.ar)
Date: Sat Aug 26 2006 - 17:00:08 CDT

I have also noticed that running one thread per node using 6 nodes is
faster than running two threads per node using only 3 nodes.

Cesar Luis Avila escribió:
> There was a problem with load balancer on a previous version of NAMD
> (2.6.B1) for which there was a workaround on NAMD's wiki. On 2.6.B2 it
> seems to be solved. I have run NAMD_2.6b2_Linux-amd64-TCP on AMD64
> dual-core nodes for a weak now and haven't experienced that problem. I
> am still experiencing some problems which I think might be related to
> charm++ or perhaps to the kernel itself. I suspect there is a problem
> with memory management when using both processors of each node. I saw
> these problems even on APOA1 simulation. Unfornately I don't know how
> to track down the problem. For now I am running simulations using only
> one processor on each node to test this hypothesis.
> I am using Debian Cluster Components (DCC) with custum compiled kernel
> 2.6.16.20 SMP.
>
> Regards
> Cesar
>
>
> Leandro Martínez escribió:
>>
>> Just for claryfing the problem a little bit more.
>> Now I put the simulation to run on a single node (the
>> master machine), which has two processors. It starts
>> running fine, two jobs each one in one processor and
>> using almost all cpu speed, as expected,
>> but eventually it returned the message:
>>
>> Info: Adjusted background load on 1 nodes.
>>
>> And the simulation starts running on only one processor.
>> Any clue on what may be going wrong?
>> Thanks,
>> Leandro.
>>
>>
>>
>>
>>
>> On 8/25/06, *Leandro Martínez* <leandromartinez98_at_gmail.com
>> <mailto:leandromartinez98_at_gmail.com>> wrote:
>>
>>
>> Hi all,
>> I'm running a simulation with NAMD_2.6b2_Linux-amd64-TCP on
>> a cluster of nine Athlon64 nodes (each processor has a dual
>> core, so there are actually 18 processors). I'm having some
>> strange problems with simulations I have already ran on several
>> other machines, and I'm not being able to find a solution.
>> Basically I start running the simulation and eventually it either
>> stops without printing any error message or it eventually starts
>> running
>> on only one processor apparently. The only message I have
>> observed to be different from our previous runs is this one:
>>
>> Info: Adjusted background load on 11 nodes.
>>
>> That is printed the first time load balancing is performed. The
>> error does not occur necessarily after that, on the other hand,
>> but that may be part of the problem, since the simulation was
>> set to be running on 18 processors (9 nodes).
>>
>> The only time I got an error message it was the one below, as you
>> may note was printed after a quite long simulation time.
>> The error is not easily reproducible, since it happens always
>> but not every time at the same point of the simulation.
>> Any help or idea will be appreciated.
>> Leandro.
>>
>>
>> ENERGY: 644800 804.7671 2363.3700 1332.0255
>> 131.9843 -201929.9812 17508.6136 0.0000
>> 0.0000 32575.8361 -147213.3846 297.3932
>> -147116.7637 -147117.4476 296.8970
>>
>> Stack Traceback:
>> [0] /lib64/libc.so.6 [0x360b32f7c0]
>> [1]
>>
>> _ZN17ComputeHomeTuplesI8BondElem4bond9BondValueE10loadTuplesEv+0x4cc
>> [0x516204]
>> [2]
>> _ZN17ComputeHomeTuplesI8BondElem4bond9BondValueE6doWorkEv+0x5c4
>> [0x519960]
>> [3] _ZN11WorkDistrib12enqueueBondsEP12LocalWorkMsg+0x16
>> [0x727b16]
>> [4]
>>
>> _ZN19CkIndex_WorkDistrib31_call_enqueueBonds_LocalWorkMsgEPvP11WorkDistrib+0xf
>> [0x727afd]
>> [5] CkDeliverMessageFree+0x21 [0x785aab]
>> [6] _Z15_processHandlerPvP11CkCoreState+0x455 [0x7850b5]
>> [7] CsdScheduleForever+0xa2 [0x7f1752]
>> [8] CsdScheduler+0x1c [0x7f1350]
>> [9] _Z10slave_initiPPc+0x10 [0x4bb034]
>> [10] _ZN7BackEnd4initEiPPc+0x28f [0x4bb019]
>> [11] main+0x47 [0x4b697f]
>> [12] __libc_start_main+0xf4 [0x360b31d084]
>> [13] _ZNSt8ios_base4InitD1Ev+0x42 [0x4b2c9a]
>>
>>
>>
>>
>
>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 05:19:43 CST