Re: NAMD "freezes" up

From: Eric R Johnson (bioinformaticist_at_mn.rr.com)
Date: Tue May 11 2004 - 08:35:51 CDT

Thanks for your input Oscar, however I have lm sensors installed and
have not seen the CPU temp go above 55C. All of the voltages and temps
seem to be well within range.
Thanks again,
Eric

Oscar Moran wrote:

> I had the same problem. I discovered that it was cused by an
> overheating of the cpu, pecaus the fan did not work properly. Even if
> it sounds stupid, I suggest to check it first. I took one month to
> detect the problem.
>
> oscar
>
> At 04.20 11/05/2004 -0500, you wrote:
>
>> Hello,
>>
>> I have a problem that I am hoping someone on the list has seen
>> before. I am currently running NAMD 2.5 (which I compiled myself
>> using gcc) on a Scyld (Version 28) cluster with dual Athlon MP 2600+
>> processors on Tyan Tiger S2466 motherboards using 2GB of memory
>> (registered ECC). Originally, the cluster had 4 nodes with gigabit
>> ethernet (Intel cards and Linksys switch) and local hard drives. On
>> these nodes, NAMD has been running perfectly. I recently added 2
>> more nodes, which are identical to the original ones (plugged into
>> the same switch), except that they are diskless and I am using the
>> on-board LAN. If I run a NAMD job on the new nodes, the jobs will
>> randomly "freeze up". This always occurs when NAMD is attempting to
>> write a file to the RAID located on the master node. For example,
>>
>> ENERGY: 23400 0.0000 0.0000 0.0000
>> 0.0000 -181725.1627 15283.0043 0.0000
>> 0.0000 19106.0771 -147336.0813 250.5568 -147295.2153
>> -147296.2616 252.0297 -126.2688 -94.8730
>> 402402.0000 70.7268 70.7469
>>
>> ENERGY: 23500 0.0000 0.0000 0.0000
>> 0.0000 -181764.1586 15145.8883 0.0000
>> 0.0000 19280.5044 -147337.7658 252.8442 -147296.9578
>> -147297.6876 251.5467 -378.1111 -348.5406
>> 402402.0000 -90.2697 -90.2879
>>
>> WRITING EXTENDED SYSTEM TO RESTART FILE AT STEP 23500
>> WRITING COORDINATES TO DCD FILE AT STEP 23500
>> WRITING COORDINATES TO RESTART FILE AT STEP 23500
>> FINISHED WRITING RESTART COORDINATES
>> WRITING VELOCITIES TO RESTART FILE AT STEP 23500
>>
>> In this case, it is writing the velocity restart file, although I
>> have seen the problem occur during the DCD file as well. On the
>> exact same system, it has "died" anywhere between 15,000 and 85,000
>> steps. After the freeze occurs, I do a tcpdump on the node in
>> question and I get the following:
>>
>> 04:02:25.344307 .3 > .-1: (frag 2336:920_at_7400)
>> 04:02:25.344682 .3 > .-1: (frag 2337:920_at_7400)
>> 04:02:25.344686 .3 > .-1: (frag 2338:920_at_7400)
>> 04:02:25.344687 .3 > .-1: (frag 2339:920_at_7400)
>> 04:02:25.344688 .3 > .-1: (frag 2340:920_at_7400)
>> 04:02:25.344689 .3 > .-1: (frag 2341:920_at_7400)
>> 04:02:25.345077 .3 > .-1: (frag 2342:920_at_7400)
>> 04:02:25.345081 .3 > .-1: (frag 2343:920_at_7400)
>> 04:02:25.345082 .3 > .-1: (frag 2344:920_at_7400)
>> 04:02:25.345085 .3 > .-1: (frag 2345:920_at_7400)
>> 04:02:25.345088 .3 > .-1: (frag 2346:920_at_7400)
>> 04:02:25.345464 .3 > .-1: (frag 2347:920_at_7400)
>> 04:02:25.345467 .3 > .-1: (frag 2348:920_at_7400)
>> 04:02:25.345468 .3 > .-1: (frag 2349:920_at_7400)
>> 04:02:25.345469 .3 > .-1: (frag 2350:920_at_7400)
>> 04:02:25.345470 .3 > .-1: (frag 2351:920_at_7400)
>> 04:02:25.345867 .3 > .-1: (frag 2352:920_at_7400)
>> 04:02:25.345870 .3 > .-1: (frag 2353:920_at_7400)
>> 04:02:25.345872 .3 > .-1: (frag 2354:920_at_7400)
>> 04:02:25.345874 .3 > .-1: (frag 2355:920_at_7400)
>> 04:02:25.345877 .3 > .-1: (frag 2356:920_at_7400)
>> 04:02:25.346249 .3 > .-1: (frag 2357:920_at_7400)
>> 04:02:25.346253 .3 > .-1: (frag 2358:920_at_7400)
>> 04:02:25.346253 .3 > .-1: (frag 2359:920_at_7400)
>> 04:02:25.346254 .3 > .-1: (frag 2360:920_at_7400)
>> 04:02:25.346255 .3 > .-1: (frag 2361:920_at_7400)
>> 04:02:25.346645 .3 > .-1: (frag 2362:920_at_7400)
>> 04:02:25.346649 .3 > .-1: (frag 2363:920_at_7400)
>> 04:02:25.346651 .3 > .-1: (frag 2364:920_at_7400)
>> 04:02:25.346653 .3 > .-1: (frag 2365:920_at_7400)
>> 04:02:25.346655 .3 > .-1: (frag 2366:920_at_7400)
>> 04:02:25.347030 .3 > .-1: (frag 2367:920_at_7400)
>> 04:02:25.347034 .3 > .-1: (frag 2368:920_at_7400)
>> 04:02:25.347035 .3 > .-1: (frag 2369:920_at_7400)
>> 04:02:25.347036 .3 > .-1: (frag 2370:920_at_7400)
>>
>> I have tried using both the "net-linux scyld" and the "net-linux
>> scyld tcp" versions of Charm++ and NAMD, both of which yield the same
>> results. I have been running in in the following manner:
>> charmrun ++p 2 ++skipmaster ++verbose ++startpe 4 ++endpe 4 ++ppn 2
>> namd2 Simulation.namd >& Simulation.namd.out
>>
>> I get the exact same problem on both of the diskless nodes, whether
>> running on 1 processor or 2. Also, I have not had any problems
>> running other programs on these nodes (small Gaussian03 jobs,
>> Autodock3, etc)
>>
>> Any thoughts on the subject would be greatly appreciated.
>> Thanks,
>> Eric
>>
>>
>> --
>> ********************************************************************
>> Eric R A Johnson
>> University Of Minnesota tel: (612) 529 0699
>> Dept. of Laboratory Medicine & Pathology
>> 7-230 BSBE e-mail:
>> <mailto:john4482_at_umn.edu>john4482_at_umn.edu
>> 312 Church Street web:
>> <http://www.eric-r-johnson.com>www.eric-r-johnson.com
>> Minneapolis, MN 55455
>> USA
>
>
> Oscar Moran Via DeMarini, 6
> Istituto di Biofisica I-16149
> Genova, Italy
> Consiglio Nazionale delle Ricerche Tel.
> +39-0106475558
> PLEASE NOTE THAT MY EMAIL ADDRESS HAS CHANGED TO: moran_at_ge.ibf.cnr.it
>
>

-- 
********************************************************************
  Eric R A Johnson
  University Of Minnesota                      tel: (612) 529 0699
  Dept. of Laboratory Medicine & Pathology   
  7-230 BSBE                              e-mail: john4482_at_umn.edu
  312 Church Street                    web: www.eric-r-johnson.com
  Minneapolis, MN 55455    
  USA                              

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:38:39 CST