Re: NAMD "freezes" up (further clarification)

From: Eric R Johnson (john4482_at_umn.edu)
Date: Tue May 11 2004 - 11:10:44 CDT

Brian,
I have run NAMD on a single machine (single processor, in fact) and get
the same problem on both nodes. The on-board LAN is 100Mbs. If no one
has had any experience with this that was software related (or maybe
problems with the on-board LAN of the Tyan Tiger S2466N-4M motherboard),
then I guess I will have to start "swapping parts" until I can come up
with the answer.
Thanks,
Eric

Brian Bennion wrote:

>Hello Eric,
>
>Have you tried running NAMD on just one of the new nodes?
>The onboard lan is gigabit right?
>If you feel confident you might try swapping out the intel cards from one
>of the older machines and putting it into the new node and running a
>single job....
>
>Regards
>Brian
>
>
>On Tue, 11 May 2004, Eric R Johnson wrote:
>
>
>
>>I would like to add the following information to the below question.
>>After noting the problems listed below, I ran cpuburn-in on them for
>>approximately 1 day with no problems. I also ran Memtest86 on it for at
>>least half a day, again with no problems.
>>Thanks,
>>Eric
>>
>>Eric R Johnson wrote:
>>
>>
>>
>>>Hello,
>>>
>>>I have a problem that I am hoping someone on the list has seen
>>>before. I am currently running NAMD 2.5 (which I compiled myself
>>>using gcc) on a Scyld (Version 28) cluster with dual Athlon MP 2600+
>>>processors on Tyan Tiger S2466 motherboards using 2GB of memory
>>>(registered ECC). Originally, the cluster had 4 nodes with gigabit
>>>ethernet (Intel cards and Linksys switch) and local hard drives. On
>>>these nodes, NAMD has been running perfectly. I recently added 2 more
>>>nodes, which are identical to the original ones (plugged into the same
>>>switch), except that they are diskless and I am using the on-board
>>>LAN. If I run a NAMD job on the new nodes, the jobs will randomly
>>>"freeze up". This always occurs when NAMD is attempting to write a
>>>file to the RAID located on the master node. For example,
>>>
>>>ENERGY: 23400 0.0000 0.0000 0.0000
>>>0.0000 -181725.1627 15283.0043 0.0000
>>>0.0000 19106.0771 -147336.0813 250.5568
>>>-147295.2153 -147296.2616 252.0297 -126.2688
>>>-94.8730 402402.0000 70.7268 70.7469
>>>
>>>ENERGY: 23500 0.0000 0.0000 0.0000
>>>0.0000 -181764.1586 15145.8883 0.0000
>>>0.0000 19280.5044 -147337.7658 252.8442
>>>-147296.9578 -147297.6876 251.5467 -378.1111
>>>-348.5406 402402.0000 -90.2697 -90.2879
>>>
>>>WRITING EXTENDED SYSTEM TO RESTART FILE AT STEP 23500
>>>WRITING COORDINATES TO DCD FILE AT STEP 23500
>>>WRITING COORDINATES TO RESTART FILE AT STEP 23500
>>>FINISHED WRITING RESTART COORDINATES
>>>WRITING VELOCITIES TO RESTART FILE AT STEP 23500
>>>
>>>In this case, it is writing the velocity restart file, although I have
>>>seen the problem occur during the DCD file as well. On the exact same
>>>system, it has "died" anywhere between 15,000 and 85,000 steps. After
>>>the freeze occurs, I do a tcpdump on the node in question and I get
>>>the following:
>>>
>>>04:02:25.344307 .3 > .-1: (frag 2336:920_at_7400)
>>>04:02:25.344682 .3 > .-1: (frag 2337:920_at_7400)
>>>04:02:25.344686 .3 > .-1: (frag 2338:920_at_7400)
>>>04:02:25.344687 .3 > .-1: (frag 2339:920_at_7400)
>>>04:02:25.344688 .3 > .-1: (frag 2340:920_at_7400)
>>>04:02:25.344689 .3 > .-1: (frag 2341:920_at_7400)
>>>04:02:25.345077 .3 > .-1: (frag 2342:920_at_7400)
>>>04:02:25.345081 .3 > .-1: (frag 2343:920_at_7400)
>>>04:02:25.345082 .3 > .-1: (frag 2344:920_at_7400)
>>>04:02:25.345085 .3 > .-1: (frag 2345:920_at_7400)
>>>04:02:25.345088 .3 > .-1: (frag 2346:920_at_7400)
>>>04:02:25.345464 .3 > .-1: (frag 2347:920_at_7400)
>>>04:02:25.345467 .3 > .-1: (frag 2348:920_at_7400)
>>>04:02:25.345468 .3 > .-1: (frag 2349:920_at_7400)
>>>04:02:25.345469 .3 > .-1: (frag 2350:920_at_7400)
>>>04:02:25.345470 .3 > .-1: (frag 2351:920_at_7400)
>>>04:02:25.345867 .3 > .-1: (frag 2352:920_at_7400)
>>>04:02:25.345870 .3 > .-1: (frag 2353:920_at_7400)
>>>04:02:25.345872 .3 > .-1: (frag 2354:920_at_7400)
>>>04:02:25.345874 .3 > .-1: (frag 2355:920_at_7400)
>>>04:02:25.345877 .3 > .-1: (frag 2356:920_at_7400)
>>>04:02:25.346249 .3 > .-1: (frag 2357:920_at_7400)
>>>04:02:25.346253 .3 > .-1: (frag 2358:920_at_7400)
>>>04:02:25.346253 .3 > .-1: (frag 2359:920_at_7400)
>>>04:02:25.346254 .3 > .-1: (frag 2360:920_at_7400)
>>>04:02:25.346255 .3 > .-1: (frag 2361:920_at_7400)
>>>04:02:25.346645 .3 > .-1: (frag 2362:920_at_7400)
>>>04:02:25.346649 .3 > .-1: (frag 2363:920_at_7400)
>>>04:02:25.346651 .3 > .-1: (frag 2364:920_at_7400)
>>>04:02:25.346653 .3 > .-1: (frag 2365:920_at_7400)
>>>04:02:25.346655 .3 > .-1: (frag 2366:920_at_7400)
>>>04:02:25.347030 .3 > .-1: (frag 2367:920_at_7400)
>>>04:02:25.347034 .3 > .-1: (frag 2368:920_at_7400)
>>>04:02:25.347035 .3 > .-1: (frag 2369:920_at_7400)
>>>04:02:25.347036 .3 > .-1: (frag 2370:920_at_7400)
>>>
>>>I have tried using both the "net-linux scyld" and the "net-linux scyld
>>>tcp" versions of Charm++ and NAMD, both of which yield the same
>>>results. I have been running in in the following manner:
>>>charmrun ++p 2 ++skipmaster ++verbose ++startpe 4 ++endpe 4 ++ppn 2
>>>namd2 Simulation.namd >& Simulation.namd.out
>>>
>>>I get the exact same problem on both of the diskless nodes, whether
>>>running on 1 processor or 2. Also, I have not had any problems
>>>running other programs on these nodes (small Gaussian03 jobs,
>>>Autodock3, etc)
>>>
>>>Any thoughts on the subject would be greatly appreciated.
>>>Thanks,
>>>Eric
>>>
>>>
>>>--
>>>********************************************************************
>>> Eric R A Johnson
>>> University Of Minnesota tel: (612) 529 0699
>>> Dept. of Laboratory Medicine & Pathology
>>> 7-230 BSBE e-mail: john4482_at_umn.edu
>>> 312 Church Street web: www.eric-r-johnson.com
>>> Minneapolis, MN 55455
>>> USA
>>>
>>>
>>>
>>--
>>********************************************************************
>> Eric R A Johnson
>> University Of Minnesota tel: (612) 529 0699
>> Dept. of Laboratory Medicine & Pathology
>> 7-230 BSBE e-mail: john4482_at_umn.edu
>> 312 Church Street web: www.eric-r-johnson.com
>> Minneapolis, MN 55455
>> USA
>>
>>
>>
>>
>
>*****************************************************************
>**Brian Bennion, Ph.D. **
>**Computational and Systems Biology Division **
>**Biology and Biotechnology Research Program **
>**Lawrence Livermore National Laboratory **
>**P.O. Box 808, L-448 bennion1_at_llnl.gov **
>**7000 East Avenue phone: (925) 422-5722 **
>**Livermore, CA 94550 fax: (925) 424-6605 **
>*****************************************************************
>
>
>
>

-- 
********************************************************************
  Eric R A Johnson
  University Of Minnesota                      tel: (612) 529 0699
  Dept. of Laboratory Medicine & Pathology   
  7-230 BSBE                              e-mail: john4482_at_umn.edu
  312 Church Street                    web: www.eric-r-johnson.com
  Minneapolis, MN 55455    
  USA                              

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:37:36 CST