Re: Re: Info: Adjusted background load on 11 nodes.

From: Leandro Martínez (leandromartinez98_at_gmail.com)
Date: Tue Aug 29 2006 - 07:13:51 CDT

Thanks Dow, we will try that.
Another problem that we are having is that the simulations
eventually stop. Stop in the following sense: It was using
18 processors
on 9 nodes and sudenly, without printing
any error, it start running on a single processor on the master
node. Actually it does not seem to be running, because the
simulation does not advance anymore, but there is one
namd2 process using 100% of one cpu.
Any hint on what can be the cause of this problem?
Thanks
Leandro.

-------------------------------------------------------------------
Leandro Martinez
Institute of Chemistry,
State University of Campinas, Brazil
http://www.ime.unicamp.br/~martinez/packmol
-------------------------------------------------------------------

On 8/28/06, Dow Hurst <Dow.Hurst_at_mindspring.com> wrote:
> Leandro,
> Your latest error is listed on the charm++ site as a error on the
> network stack. Typically a ethernet card driver is not playing well.
> We had similar problems and had to upgrade our software that ran the
> ethernet card.
> Best wishes,
> Dow
>
> Leandro Martínez wrote:
> >
> > Hi Cesar,
> > Thanks for your reply. I'm using the last executable version
> > of namd2, and I tried both the TCP and the other AMD64 executable. The
> > loadbalacing error now appears not to
> > be happening anymore, but actually the scalling is not that
> > good as it was in our Opteron cluster. In our cluster running
> > on different nodes is not faster than running the same number
> > of processes one on each node, as you have observed.
> > We are using fedora 5.0 with the kernel 2.6.18-rc2 #13 SMP.
> > We are still testing things and this is being very hard
> > because the errors are not deterministic.
> > I eventually get errors like:
> >
> > Charmrun: error on request socket--
> > Socket closed before recv.
> >
> > For which I have no clues.
> > Leandro.
> >
> >
> >
> > On 8/26/06, *Cesar Luis Avila * <cavila_at_fbqf.unt.edu.ar
> > <mailto:cavila_at_fbqf.unt.edu.ar>> wrote:
> >
> > I have also noticed that running one thread per node using 6 nodes is
> > faster than running two threads per node using only 3 nodes.
> >
> > Cesar Luis Avila escribió:
> > > There was a problem with load balancer on a previous version of
> > NAMD
> > > (2.6.B1) for which there was a workaround on NAMD's wiki. On
> > 2.6.B2 it
> > > seems to be solved. I have run NAMD_2.6b2_Linux-amd64-TCP on AMD64
> > > dual-core nodes for a weak now and haven't experienced that
> > problem. I
> > > am still experiencing some problems which I think might be
> > related to
> > > charm++ or perhaps to the kernel itself. I suspect there is a
> > problem
> > > with memory management when using both processors of each node.
> > I saw
> > > these problems even on APOA1 simulation. Unfornately I don't
> > know how
> > > to track down the problem. For now I am running simulations
> > using only
> > > one processor on each node to test this hypothesis.
> > > I am using Debian Cluster Components (DCC) with custum compiled
> > kernel
> > > 2.6.16.20 <http://2.6.16.20> SMP.
> > >
> > > Regards
> > > Cesar
> > >
> > >
> > > Leandro Martínez escribió:
> > >>
> > >> Just for claryfing the problem a little bit more.
> > >> Now I put the simulation to run on a single node (the
> > >> master machine), which has two processors. It starts
> > >> running fine, two jobs each one in one processor and
> > >> using almost all cpu speed, as expected,
> > >> but eventually it returned the message:
> > >>
> > >> Info: Adjusted background load on 1 nodes.
> > >>
> > >> And the simulation starts running on only one processor.
> > >> Any clue on what may be going wrong?
> > >> Thanks,
> > >> Leandro.
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> On 8/25/06, *Leandro Martínez* < leandromartinez98_at_gmail.com
> > <mailto:leandromartinez98_at_gmail.com>
> > >> <mailto:leandromartinez98_at_gmail.com
> > <mailto:leandromartinez98_at_gmail.com>>> wrote:
> > >>
> > >>
> > >> Hi all,
> > >> I'm running a simulation with NAMD_2.6b2_Linux-amd64-TCP on
> > >> a cluster of nine Athlon64 nodes (each processor has a dual
> > >> core, so there are actually 18 processors). I'm having some
> > >> strange problems with simulations I have already ran on
> > several
> > >> other machines, and I'm not being able to find a solution.
> > >> Basically I start running the simulation and eventually it
> > either
> > >> stops without printing any error message or it eventually
> > starts
> > >> running
> > >> on only one processor apparently. The only message I have
> > >> observed to be different from our previous runs is this one:
> > >>
> > >> Info: Adjusted background load on 11 nodes.
> > >>
> > >> That is printed the first time load balancing is performed. The
> > >> error does not occur necessarily after that, on the other hand,
> > >> but that may be part of the problem, since the simulation was
> > >> set to be running on 18 processors (9 nodes).
> > >>
> > >> The only time I got an error message it was the one below,
> > as you
> > >> may note was printed after a quite long simulation time.
> > >> The error is not easily reproducible, since it happens always
> > >> but not every time at the same point of the simulation.
> > >> Any help or idea will be appreciated.
> > >> Leandro.
> > >>
> > >>
> > >> ENERGY: 644800 804.7671 2363.3700 1332.0255
> > >> 131.9843 -201929.9812 17508.6136 0.0000
> > >> 0.0000 32575.8361 - 147213.3846 297.3932
> > >> -147116.7637 -147117.4476 296.8970
> > >>
> > >> Stack Traceback:
> > >> [0] /lib64/libc.so.6 [0x360b32f7c0]
> > >> [1]
> > >>
> > >>
> > _ZN17ComputeHomeTuplesI8BondElem4bond9BondValueE10loadTuplesEv+0x4cc
> > >> [0x516204]
> > >> [2]
> > >> _ZN17ComputeHomeTuplesI8BondElem4bond9BondValueE6doWorkEv+0x5c4
> > >> [0x519960]
> > >> [3] _ZN11WorkDistrib12enqueueBondsEP12LocalWorkMsg+0x16
> > >> [0x727b16]
> > >> [4]
> > >>
> > >>
> > _ZN19CkIndex_WorkDistrib31_call_enqueueBonds_LocalWorkMsgEPvP11WorkDistrib+0xf
> > >> [0x727afd]
> > >> [5] CkDeliverMessageFree+0x21 [0x785aab]
> > >> [6] _Z15_processHandlerPvP11CkCoreState+0x455 [0x7850b5]
> > >> [7] CsdScheduleForever+0xa2 [0x7f1752]
> > >> [8] CsdScheduler+0x1c [0x7f1350]
> > >> [9] _Z10slave_initiPPc+0x10 [0x4bb034]
> > >> [10] _ZN7BackEnd4initEiPPc+0x28f [0x4bb019]
> > >> [11] main+0x47 [0x4b697f]
> > >> [12] __libc_start_main+0xf4 [0x360b31d084]
> > >> [13] _ZNSt8ios_base4InitD1Ev+0x42 [0x4b2c9a]
> > >>
> > >>
> > >>
> > >>
> > >
> > >
> >
> >
> >
>
>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:42:31 CST