Re: load balancer, athlon 64 dual core

From: Leandro Martínez (leandromartinez98_at_gmail.com)
Date: Thu Sep 14 2006 - 14:51:07 CDT

Well, Jim, the messages before it hangs up are bellow, but I don't
see anything strange. Actually the only strange message I get
that does not appear for the same simulation running in other cluster
is

Info: Adjusted background load on 9 nodes.

Sometimes the number of nodes is wrong (there are 9 nodes indeed).
But this message appears at the beginning of the simulation, not
before the hang up.
Thanks,
Leandro.

Info: Initial time: 18 CPUs 0.135476 s/step 1.04534 days/ns 65456 kB memory
Info: Initial time: 18 CPUs 0.135342 s/step 1.04431 days/ns 65456 kB memory
Info: Initial time: 18 CPUs 0.135558 s/step 1.04597 days/ns 65456 kB memory
Info: Benchmark time: 18 CPUs 0.137251 s/step 1.05904 days/ns 65456 kB
memory
Info: Benchmark time: 18 CPUs 0.137215 s/step 1.05876 days/ns 65456 kB
memory
Info: Benchmark time: 18 CPUs 0.140959 s/step 1.08765 days/ns 65456 kB
memory
RESCALING VELOCITIES AT STEP 396000 FROM AVERAGE TEMPERATURE OF 397.151 TO
397.523 KELVIN.
TIMING: 396000 CPU: 48896.9, 0.135488/step Wall: 52008, 0.145937/step, 0
hours remaining, 65456 kB of memory in use.
ENERGY: 396000 24240.6031 15456.2567 1488.8839 166.7119
-222846.9949 18767.1476 0.0000 0.0000 70453.3697
-92274.0220 397.1608 -90761.3592 -90769.4150 397.1501
-191.1329 -371.5830 636056.0000 -298.5692 -299.0344

TCL: Setting parameter rescaleTemp to 397.648743711
TCL: Running for 1000 steps
ENERGY: 396000 24240.6031 15456.2567 1488.8839 166.7119
-222846.9949 18767.1476 0.0000 0.0000 70453.3697
-92274.0220 397.1608 -90761.3592 -90768.9115 397.1608
-191.1329 -371.5830 636056.0000 -191.1329 -371.5830

LDB: LOAD: AVG 13.6148 MAX 19.9645 MSGS: TOTAL 229 MAXC 14 MAXP 6 None

-----------Hangs up...

On 9/13/06, Jim Phillips <jim_at_ks.uiuc.edu> wrote:
>
>
> It's definitely getting stuck in some kind of load balancer loop. Is
> there any output just before the hang? You should at least get one LDB
> line that might be useful.
>
> -Jim
>
>
> On Wed, 13 Sep 2006, Leandro Martínez wrote:
>
> > Hi all,
> > I'm still having problems in running namd2 in our Athlon 64 Dual Core
> > machines. The problem is that the simulation runs well to a point where
> > all processes, except for one, stop, and I get a single process in a
> > single cpu running. The simulation does not crash, but it does not
> > continues as well, and this single process appears to last forever
> > doing something I don't know what it is.
> >
> > Now, as Jim suggested, I have attached gdb to this process. I have
> > never used it, but I could get the information bellow. Any help is
> > appreciated. I believe the bolded output bellow is the one referring
> > to the namd2 process.
> >
> > ------------ OUTPUT FROM GDB: --------------------------
> >
> > Attaching to program: /usr/bin/namd2, process 19438
> > Reading symbols from /lib64/libdl.so.2...(no debugging symbols
> > found)...done.
> > Loaded symbols for /lib64/libdl.so.2
> > Reading symbols from /lib64/libm.so.6...(no debugging symbols
> found)...done.
> > Loaded symbols for /lib64/libm.so.6
> > Reading symbols from /usr/lib64/libstdc++.so.6...(no debugging symbols
> > found)...done.
> > Loaded symbols for /usr/lib64/libstdc++.so.6
> > Reading symbols from /lib64/libc.so.6...
> > (no debugging symbols found)...done.
> > Loaded symbols for /lib64/libc.so.6
> > Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols
> > found)...done.
> > Loaded symbols for /lib64/ld-linux-x86-64.so.2
> > Reading symbols from /lib64/libgcc_s.so.1...(no debugging symbols
> > found)...done.
> > Loaded symbols for /lib64/libgcc_s.so.1
> > Reading symbols from /lib64/libnss_files.so.2...
> > (no debugging symbols found)...done.
> > Loaded symbols for /lib64/libnss_files.so.2
> > 0x0000000000714caa in Set::find ()
> > (gdb) next
> > Single stepping until exit from function _ZN3Set4findEP10InfoRecord,
> > which has no line number information.
> > 0x00000000006f407f in Rebalancer::numAvailable ()
> > (gdb) next
> > Single stepping until exit from function
> > _ZN10Rebalancer12numAvailableEP11computeInfoP13processorInfoPiS4_S4_,
> > which has no line number information.
> > 0x00000000006f3f34 in Rebalancer::refine_togrid ()
> > (gdb) next
> > Single stepping until exit from function
> >
> _ZN10Rebalancer13refine_togridERA3_A3_A2_NS_6pcpairEdP13processorInfoP11computeInfo,
> > which has no line number information.
> > 0x00000000006f23b5 in Rebalancer::refine ()
> > (gdb) next
> > Single stepping until exit from function _ZN10Rebalancer6refineEv,
> > which has no line number information.
> > -----------------------------------------------------------------------
> >> From this point on nothing happens.
> >
> > Thank you very much,
> > Leandro.
> >
> >
> >
> >
> > --------------------------------------------------------------------
> > Leandro Martinez
> > Institute of Chemistry
> > State University of Campinas, Brazil
> > http://www.ime.unicamp.br/~martinez/packmol
> > --------------------------------------------------------------------
> >
>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:44:00 CST