Re: load balancer, athlon 64 dual core

From: Jim Phillips (jim_at_ks.uiuc.edu)
Date: Thu Sep 14 2006 - 23:15:51 CDT

Thanks. The fact that it "LDB: ... None" data was printed indicates that
load balancing started but did not complete. I can't tell anything else.
The background load adjustment message is normal.

-Jim

On Thu, 14 Sep 2006, Leandro Martínez wrote:

> Well, Jim, the messages before it hangs up are bellow, but I don't
> see anything strange. Actually the only strange message I get
> that does not appear for the same simulation running in other cluster
> is
>
> Info: Adjusted background load on 9 nodes.
>
> Sometimes the number of nodes is wrong (there are 9 nodes indeed).
> But this message appears at the beginning of the simulation, not
> before the hang up.
> Thanks,
> Leandro.
>
>
>
>
>
> Info: Initial time: 18 CPUs 0.135476 s/step 1.04534 days/ns 65456 kB memory
> Info: Initial time: 18 CPUs 0.135342 s/step 1.04431 days/ns 65456 kB memory
> Info: Initial time: 18 CPUs 0.135558 s/step 1.04597 days/ns 65456 kB memory
> Info: Benchmark time: 18 CPUs 0.137251 s/step 1.05904 days/ns 65456 kB
> memory
> Info: Benchmark time: 18 CPUs 0.137215 s/step 1.05876 days/ns 65456 kB
> memory
> Info: Benchmark time: 18 CPUs 0.140959 s/step 1.08765 days/ns 65456 kB
> memory
> RESCALING VELOCITIES AT STEP 396000 FROM AVERAGE TEMPERATURE OF 397.151 TO
> 397.523 KELVIN.
> TIMING: 396000 CPU: 48896.9, 0.135488/step Wall: 52008, 0.145937/step, 0
> hours remaining, 65456 kB of memory in use.
> ENERGY: 396000 24240.6031 15456.2567 1488.8839 166.7119
> -222846.9949 18767.1476 0.0000 0.0000 70453.3697
> -92274.0220 397.1608 -90761.3592 -90769.4150 397.1501
> -191.1329 -371.5830 636056.0000 -298.5692 -299.0344
>
> TCL: Setting parameter rescaleTemp to 397.648743711
> TCL: Running for 1000 steps
> ENERGY: 396000 24240.6031 15456.2567 1488.8839 166.7119
> -222846.9949 18767.1476 0.0000 0.0000 70453.3697
> -92274.0220 397.1608 -90761.3592 -90768.9115 397.1608
> -191.1329 -371.5830 636056.0000 -191.1329 -371.5830
>
> LDB: LOAD: AVG 13.6148 MAX 19.9645 MSGS: TOTAL 229 MAXC 14 MAXP 6 None
>
> -----------Hangs up...
>
>
>
>
> On 9/13/06, Jim Phillips <jim_at_ks.uiuc.edu> wrote:
>>
>>
>> It's definitely getting stuck in some kind of load balancer loop. Is
>> there any output just before the hang? You should at least get one LDB
>> line that might be useful.
>>
>> -Jim
>>
>>
>> On Wed, 13 Sep 2006, Leandro Martínez wrote:
>>
>> > Hi all,
>> > I'm still having problems in running namd2 in our Athlon 64 Dual Core
>> > machines. The problem is that the simulation runs well to a point where
>> > all processes, except for one, stop, and I get a single process in a
>> > single cpu running. The simulation does not crash, but it does not
>> > continues as well, and this single process appears to last forever
>> > doing something I don't know what it is.
>> >
>> > Now, as Jim suggested, I have attached gdb to this process. I have
>> > never used it, but I could get the information bellow. Any help is
>> > appreciated. I believe the bolded output bellow is the one referring
>> > to the namd2 process.
>> >
>> > ------------ OUTPUT FROM GDB: --------------------------
>> >
>> > Attaching to program: /usr/bin/namd2, process 19438
>> > Reading symbols from /lib64/libdl.so.2...(no debugging symbols
>> > found)...done.
>> > Loaded symbols for /lib64/libdl.so.2
>> > Reading symbols from /lib64/libm.so.6...(no debugging symbols
>> found)...done.
>> > Loaded symbols for /lib64/libm.so.6
>> > Reading symbols from /usr/lib64/libstdc++.so.6...(no debugging symbols
>> > found)...done.
>> > Loaded symbols for /usr/lib64/libstdc++.so.6
>> > Reading symbols from /lib64/libc.so.6...
>> > (no debugging symbols found)...done.
>> > Loaded symbols for /lib64/libc.so.6
>> > Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols
>> > found)...done.
>> > Loaded symbols for /lib64/ld-linux-x86-64.so.2
>> > Reading symbols from /lib64/libgcc_s.so.1...(no debugging symbols
>> > found)...done.
>> > Loaded symbols for /lib64/libgcc_s.so.1
>> > Reading symbols from /lib64/libnss_files.so.2...
>> > (no debugging symbols found)...done.
>> > Loaded symbols for /lib64/libnss_files.so.2
>> > 0x0000000000714caa in Set::find ()
>> > (gdb) next
>> > Single stepping until exit from function _ZN3Set4findEP10InfoRecord,
>> > which has no line number information.
>> > 0x00000000006f407f in Rebalancer::numAvailable ()
>> > (gdb) next
>> > Single stepping until exit from function
>> > _ZN10Rebalancer12numAvailableEP11computeInfoP13processorInfoPiS4_S4_,
>> > which has no line number information.
>> > 0x00000000006f3f34 in Rebalancer::refine_togrid ()
>> > (gdb) next
>> > Single stepping until exit from function
>> >
>> _ZN10Rebalancer13refine_togridERA3_A3_A2_NS_6pcpairEdP13processorInfoP11computeInfo,
>> > which has no line number information.
>> > 0x00000000006f23b5 in Rebalancer::refine ()
>> > (gdb) next
>> > Single stepping until exit from function _ZN10Rebalancer6refineEv,
>> > which has no line number information.
>> > -----------------------------------------------------------------------
>> >> From this point on nothing happens.
>> >
>> > Thank you very much,
>> > Leandro.
>> >
>> >
>> >
>> >
>> > --------------------------------------------------------------------
>> > Leandro Martinez
>> > Institute of Chemistry
>> > State University of Campinas, Brazil
>> > http://www.ime.unicamp.br/~martinez/packmol
>> > --------------------------------------------------------------------
>> >
>>
>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:42:35 CST