Re: NAMD "freezes" up (further clarification)

From: Eric R Johnson (bioinformaticist_at_mn.rr.com)
Date: Tue May 11 2004 - 12:34:32 CDT

Jerry,

Jerry Ebalunode wrote:

>I noticed that you are using
>this command to run NAMD
> charmrun ++p 2 ++skipmaster ++verbose ++startpe 4 ++endpe 4 ++ppn 2 ......
>From experience I have had higher success rates on completing namd jobs in
>clusters with two procs/node using only one processor at a time per node for
>namd. you seem to be requesting for two processess per node. Have you been
>able to duplicate this problem using 1 processor per node.
>
>
In this particular case, I used two processors per node, but using a
single processor like:

charmrun ++p 1 ++skipmaster ++verbose ++startpe 4 ++endpe 4 ++ppn 1

yields the same results.

>I see you run autodock3 on your cluster. Is by chance in parallel or just
>serial?
>
This is serial, but as I just wrote above, I have problems using NAMD
with just a single processor also.
Thanks,
Eric

>On Tuesday 11 May 2004 10:10 am, you wrote:
>
>
>>Hello Eric,
>>
>>Have you tried running NAMD on just one of the new nodes?
>>The onboard lan is gigabit right?
>>If you feel confident you might try swapping out the intel cards from one
>>of the older machines and putting it into the new node and running a
>>single job....
>>
>>Regards
>>Brian
>>
>>On Tue, 11 May 2004, Eric R Johnson wrote:
>>
>>
>>>I would like to add the following information to the below question.
>>>After noting the problems listed below, I ran cpuburn-in on them for
>>>approximately 1 day with no problems. I also ran Memtest86 on it for at
>>>least half a day, again with no problems.
>>>Thanks,
>>>Eric
>>>
>>>Eric R Johnson wrote:
>>>
>>>
>>>>Hello,
>>>>
>>>>I have a problem that I am hoping someone on the list has seen
>>>>before. I am currently running NAMD 2.5 (which I compiled myself
>>>>using gcc) on a Scyld (Version 28) cluster with dual Athlon MP 2600+
>>>>processors on Tyan Tiger S2466 motherboards using 2GB of memory
>>>>(registered ECC). Originally, the cluster had 4 nodes with gigabit
>>>>ethernet (Intel cards and Linksys switch) and local hard drives. On
>>>>these nodes, NAMD has been running perfectly. I recently added 2 more
>>>>nodes, which are identical to the original ones (plugged into the same
>>>>switch), except that they are diskless and I am using the on-board
>>>>LAN. If I run a NAMD job on the new nodes, the jobs will randomly
>>>>"freeze up". This always occurs when NAMD is attempting to write a
>>>>file to the RAID located on the master node. For example,
>>>>
>>>>ENERGY: 23400 0.0000 0.0000 0.0000
>>>>0.0000 -181725.1627 15283.0043 0.0000
>>>>0.0000 19106.0771 -147336.0813 250.5568
>>>>-147295.2153 -147296.2616 252.0297 -126.2688
>>>>-94.8730 402402.0000 70.7268 70.7469
>>>>
>>>>ENERGY: 23500 0.0000 0.0000 0.0000
>>>>0.0000 -181764.1586 15145.8883 0.0000
>>>>0.0000 19280.5044 -147337.7658 252.8442
>>>>-147296.9578 -147297.6876 251.5467 -378.1111
>>>>-348.5406 402402.0000 -90.2697 -90.2879
>>>>
>>>>WRITING EXTENDED SYSTEM TO RESTART FILE AT STEP 23500
>>>>WRITING COORDINATES TO DCD FILE AT STEP 23500
>>>>WRITING COORDINATES TO RESTART FILE AT STEP 23500
>>>>FINISHED WRITING RESTART COORDINATES
>>>>WRITING VELOCITIES TO RESTART FILE AT STEP 23500
>>>>
>>>>In this case, it is writing the velocity restart file, although I have
>>>>seen the problem occur during the DCD file as well. On the exact same
>>>>system, it has "died" anywhere between 15,000 and 85,000 steps. After
>>>>the freeze occurs, I do a tcpdump on the node in question and I get
>>>>the following:
>>>>
>>>>04:02:25.344307 .3 > .-1: (frag 2336:920_at_7400)
>>>>04:02:25.344682 .3 > .-1: (frag 2337:920_at_7400)
>>>>04:02:25.344686 .3 > .-1: (frag 2338:920_at_7400)
>>>>04:02:25.344687 .3 > .-1: (frag 2339:920_at_7400)
>>>>04:02:25.344688 .3 > .-1: (frag 2340:920_at_7400)
>>>>04:02:25.344689 .3 > .-1: (frag 2341:920_at_7400)
>>>>04:02:25.345077 .3 > .-1: (frag 2342:920_at_7400)
>>>>04:02:25.345081 .3 > .-1: (frag 2343:920_at_7400)
>>>>04:02:25.345082 .3 > .-1: (frag 2344:920_at_7400)
>>>>04:02:25.345085 .3 > .-1: (frag 2345:920_at_7400)
>>>>04:02:25.345088 .3 > .-1: (frag 2346:920_at_7400)
>>>>04:02:25.345464 .3 > .-1: (frag 2347:920_at_7400)
>>>>04:02:25.345467 .3 > .-1: (frag 2348:920_at_7400)
>>>>04:02:25.345468 .3 > .-1: (frag 2349:920_at_7400)
>>>>04:02:25.345469 .3 > .-1: (frag 2350:920_at_7400)
>>>>04:02:25.345470 .3 > .-1: (frag 2351:920_at_7400)
>>>>04:02:25.345867 .3 > .-1: (frag 2352:920_at_7400)
>>>>04:02:25.345870 .3 > .-1: (frag 2353:920_at_7400)
>>>>04:02:25.345872 .3 > .-1: (frag 2354:920_at_7400)
>>>>04:02:25.345874 .3 > .-1: (frag 2355:920_at_7400)
>>>>04:02:25.345877 .3 > .-1: (frag 2356:920_at_7400)
>>>>04:02:25.346249 .3 > .-1: (frag 2357:920_at_7400)
>>>>04:02:25.346253 .3 > .-1: (frag 2358:920_at_7400)
>>>>04:02:25.346253 .3 > .-1: (frag 2359:920_at_7400)
>>>>04:02:25.346254 .3 > .-1: (frag 2360:920_at_7400)
>>>>04:02:25.346255 .3 > .-1: (frag 2361:920_at_7400)
>>>>04:02:25.346645 .3 > .-1: (frag 2362:920_at_7400)
>>>>04:02:25.346649 .3 > .-1: (frag 2363:920_at_7400)
>>>>04:02:25.346651 .3 > .-1: (frag 2364:920_at_7400)
>>>>04:02:25.346653 .3 > .-1: (frag 2365:920_at_7400)
>>>>04:02:25.346655 .3 > .-1: (frag 2366:920_at_7400)
>>>>04:02:25.347030 .3 > .-1: (frag 2367:920_at_7400)
>>>>04:02:25.347034 .3 > .-1: (frag 2368:920_at_7400)
>>>>04:02:25.347035 .3 > .-1: (frag 2369:920_at_7400)
>>>>04:02:25.347036 .3 > .-1: (frag 2370:920_at_7400)
>>>>
>>>>I have tried using both the "net-linux scyld" and the "net-linux scyld
>>>>tcp" versions of Charm++ and NAMD, both of which yield the same
>>>>results. I have been running in in the following manner:
>>>>charmrun ++p 2 ++skipmaster ++verbose ++startpe 4 ++endpe 4 ++ppn 2
>>>>namd2 Simulation.namd >& Simulation.namd.out
>>>>
>>>>I get the exact same problem on both of the diskless nodes, whether
>>>>running on 1 processor or 2. Also, I have not had any problems
>>>>running other programs on these nodes (small Gaussian03 jobs,
>>>>Autodock3, etc)
>>>>
>>>>Any thoughts on the subject would be greatly appreciated.
>>>>Thanks,
>>>>Eric
>>>>
>>>>
>>>>--
>>>>********************************************************************
>>>> Eric R A Johnson
>>>> University Of Minnesota tel: (612) 529 0699
>>>> Dept. of Laboratory Medicine & Pathology
>>>> 7-230 BSBE e-mail: john4482_at_umn.edu
>>>> 312 Church Street web: www.eric-r-johnson.com
>>>> Minneapolis, MN 55455
>>>> USA
>>>>
>>>>
>>>--
>>>********************************************************************
>>> Eric R A Johnson
>>> University Of Minnesota tel: (612) 529 0699
>>> Dept. of Laboratory Medicine & Pathology
>>> 7-230 BSBE e-mail: john4482_at_umn.edu
>>> 312 Church Street web: www.eric-r-johnson.com
>>> Minneapolis, MN 55455
>>> USA
>>>
>>>
>>*****************************************************************
>>**Brian Bennion, Ph.D. **
>>**Computational and Systems Biology Division **
>>**Biology and Biotechnology Research Program **
>>**Lawrence Livermore National Laboratory **
>>**P.O. Box 808, L-448 bennion1_at_llnl.gov **
>>**7000 East Avenue phone: (925) 422-5722 **
>>**Livermore, CA 94550 fax: (925) 424-6605 **
>>*****************************************************************
>>
>>
>
>
>

-- 
********************************************************************
  Eric R A Johnson
  University Of Minnesota                      tel: (612) 529 0699
  Dept. of Laboratory Medicine & Pathology   
  7-230 BSBE                              e-mail: john4482_at_umn.edu
  312 Church Street                    web: www.eric-r-johnson.com
  Minneapolis, MN 55455    
  USA                              

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:37:36 CST