Re: Re: Info: Adjusted background load on 11 nodes.

From: Leandro Martínez (leandromartinez98_at_gmail.com)
Date: Mon Aug 28 2006 - 12:21:49 CDT

Hi Cesar,
Thanks for your reply. I'm using the last executable version
of namd2, and I tried both the TCP and the other AMD64 executable. The
loadbalacing error now appears not to
be happening anymore, but actually the scalling is not that
good as it was in our Opteron cluster. In our cluster running
on different nodes is not faster than running the same number
of processes one on each node, as you have observed.
We are using fedora 5.0 with the kernel 2.6.18-rc2 #13 SMP.
We are still testing things and this is being very hard
because the errors are not deterministic.
I eventually get errors like:

Charmrun: error on request socket--
Socket closed before recv.

For which I have no clues.
Leandro.

On 8/26/06, Cesar Luis Avila <cavila_at_fbqf.unt.edu.ar> wrote:
>
> I have also noticed that running one thread per node using 6 nodes is
> faster than running two threads per node using only 3 nodes.
>
> Cesar Luis Avila escribió:
> > There was a problem with load balancer on a previous version of NAMD
> > (2.6.B1) for which there was a workaround on NAMD's wiki. On 2.6.B2 it
> > seems to be solved. I have run NAMD_2.6b2_Linux-amd64-TCP on AMD64
> > dual-core nodes for a weak now and haven't experienced that problem. I
> > am still experiencing some problems which I think might be related to
> > charm++ or perhaps to the kernel itself. I suspect there is a problem
> > with memory management when using both processors of each node. I saw
> > these problems even on APOA1 simulation. Unfornately I don't know how
> > to track down the problem. For now I am running simulations using only
> > one processor on each node to test this hypothesis.
> > I am using Debian Cluster Components (DCC) with custum compiled kernel
> > 2.6.16.20 SMP.
> >
> > Regards
> > Cesar
> >
> >
> > Leandro Martínez escribió:
> >>
> >> Just for claryfing the problem a little bit more.
> >> Now I put the simulation to run on a single node (the
> >> master machine), which has two processors. It starts
> >> running fine, two jobs each one in one processor and
> >> using almost all cpu speed, as expected,
> >> but eventually it returned the message:
> >>
> >> Info: Adjusted background load on 1 nodes.
> >>
> >> And the simulation starts running on only one processor.
> >> Any clue on what may be going wrong?
> >> Thanks,
> >> Leandro.
> >>
> >>
> >>
> >>
> >>
> >> On 8/25/06, *Leandro Martínez* < leandromartinez98_at_gmail.com
> >> <mailto:leandromartinez98_at_gmail.com>> wrote:
> >>
> >>
> >> Hi all,
> >> I'm running a simulation with NAMD_2.6b2_Linux-amd64-TCP on
> >> a cluster of nine Athlon64 nodes (each processor has a dual
> >> core, so there are actually 18 processors). I'm having some
> >> strange problems with simulations I have already ran on several
> >> other machines, and I'm not being able to find a solution.
> >> Basically I start running the simulation and eventually it either
> >> stops without printing any error message or it eventually starts
> >> running
> >> on only one processor apparently. The only message I have
> >> observed to be different from our previous runs is this one:
> >>
> >> Info: Adjusted background load on 11 nodes.
> >>
> >> That is printed the first time load balancing is performed. The
> >> error does not occur necessarily after that, on the other hand,
> >> but that may be part of the problem, since the simulation was
> >> set to be running on 18 processors (9 nodes).
> >>
> >> The only time I got an error message it was the one below, as you
> >> may note was printed after a quite long simulation time.
> >> The error is not easily reproducible, since it happens always
> >> but not every time at the same point of the simulation.
> >> Any help or idea will be appreciated.
> >> Leandro.
> >>
> >>
> >> ENERGY: 644800 804.7671 2363.3700 1332.0255
> >> 131.9843 -201929.9812 17508.6136 0.0000
> >> 0.0000 32575.8361 - 147213.3846 297.3932
> >> -147116.7637 -147117.4476 296.8970
> >>
> >> Stack Traceback:
> >> [0] /lib64/libc.so.6 [0x360b32f7c0]
> >> [1]
> >>
> >> _ZN17ComputeHomeTuplesI8BondElem4bond9BondValueE10loadTuplesEv+0x4cc
> >> [0x516204]
> >> [2]
> >> _ZN17ComputeHomeTuplesI8BondElem4bond9BondValueE6doWorkEv+0x5c4
> >> [0x519960]
> >> [3] _ZN11WorkDistrib12enqueueBondsEP12LocalWorkMsg+0x16
> >> [0x727b16]
> >> [4]
> >>
> >>
> _ZN19CkIndex_WorkDistrib31_call_enqueueBonds_LocalWorkMsgEPvP11WorkDistrib+0xf
> >> [0x727afd]
> >> [5] CkDeliverMessageFree+0x21 [0x785aab]
> >> [6] _Z15_processHandlerPvP11CkCoreState+0x455 [0x7850b5]
> >> [7] CsdScheduleForever+0xa2 [0x7f1752]
> >> [8] CsdScheduler+0x1c [0x7f1350]
> >> [9] _Z10slave_initiPPc+0x10 [0x4bb034]
> >> [10] _ZN7BackEnd4initEiPPc+0x28f [0x4bb019]
> >> [11] main+0x47 [0x4b697f]
> >> [12] __libc_start_main+0xf4 [0x360b31d084]
> >> [13] _ZNSt8ios_base4InitD1Ev+0x42 [0x4b2c9a]
> >>
> >>
> >>
> >>
> >
> >
>
>
>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:42:30 CST