Re: NAMD "freezes" up (further clarification)

From: Brian Bennion (brian_at_youkai.llnl.gov)
Date: Tue May 11 2004 - 10:10:46 CDT

Hello Eric,

Have you tried running NAMD on just one of the new nodes?
The onboard lan is gigabit right?
If you feel confident you might try swapping out the intel cards from one
of the older machines and putting it into the new node and running a
single job....

Regards
Brian

On Tue, 11 May 2004, Eric R Johnson wrote:

> I would like to add the following information to the below question.
> After noting the problems listed below, I ran cpuburn-in on them for
> approximately 1 day with no problems. I also ran Memtest86 on it for at
> least half a day, again with no problems.
> Thanks,
> Eric
>
> Eric R Johnson wrote:
>
> > Hello,
> >
> > I have a problem that I am hoping someone on the list has seen
> > before. I am currently running NAMD 2.5 (which I compiled myself
> > using gcc) on a Scyld (Version 28) cluster with dual Athlon MP 2600+
> > processors on Tyan Tiger S2466 motherboards using 2GB of memory
> > (registered ECC). Originally, the cluster had 4 nodes with gigabit
> > ethernet (Intel cards and Linksys switch) and local hard drives. On
> > these nodes, NAMD has been running perfectly. I recently added 2 more
> > nodes, which are identical to the original ones (plugged into the same
> > switch), except that they are diskless and I am using the on-board
> > LAN. If I run a NAMD job on the new nodes, the jobs will randomly
> > "freeze up". This always occurs when NAMD is attempting to write a
> > file to the RAID located on the master node. For example,
> >
> > ENERGY: 23400 0.0000 0.0000 0.0000
> > 0.0000 -181725.1627 15283.0043 0.0000
> > 0.0000 19106.0771 -147336.0813 250.5568
> > -147295.2153 -147296.2616 252.0297 -126.2688
> > -94.8730 402402.0000 70.7268 70.7469
> >
> > ENERGY: 23500 0.0000 0.0000 0.0000
> > 0.0000 -181764.1586 15145.8883 0.0000
> > 0.0000 19280.5044 -147337.7658 252.8442
> > -147296.9578 -147297.6876 251.5467 -378.1111
> > -348.5406 402402.0000 -90.2697 -90.2879
> >
> > WRITING EXTENDED SYSTEM TO RESTART FILE AT STEP 23500
> > WRITING COORDINATES TO DCD FILE AT STEP 23500
> > WRITING COORDINATES TO RESTART FILE AT STEP 23500
> > FINISHED WRITING RESTART COORDINATES
> > WRITING VELOCITIES TO RESTART FILE AT STEP 23500
> >
> > In this case, it is writing the velocity restart file, although I have
> > seen the problem occur during the DCD file as well. On the exact same
> > system, it has "died" anywhere between 15,000 and 85,000 steps. After
> > the freeze occurs, I do a tcpdump on the node in question and I get
> > the following:
> >
> > 04:02:25.344307 .3 > .-1: (frag 2336:920_at_7400)
> > 04:02:25.344682 .3 > .-1: (frag 2337:920_at_7400)
> > 04:02:25.344686 .3 > .-1: (frag 2338:920_at_7400)
> > 04:02:25.344687 .3 > .-1: (frag 2339:920_at_7400)
> > 04:02:25.344688 .3 > .-1: (frag 2340:920_at_7400)
> > 04:02:25.344689 .3 > .-1: (frag 2341:920_at_7400)
> > 04:02:25.345077 .3 > .-1: (frag 2342:920_at_7400)
> > 04:02:25.345081 .3 > .-1: (frag 2343:920_at_7400)
> > 04:02:25.345082 .3 > .-1: (frag 2344:920_at_7400)
> > 04:02:25.345085 .3 > .-1: (frag 2345:920_at_7400)
> > 04:02:25.345088 .3 > .-1: (frag 2346:920_at_7400)
> > 04:02:25.345464 .3 > .-1: (frag 2347:920_at_7400)
> > 04:02:25.345467 .3 > .-1: (frag 2348:920_at_7400)
> > 04:02:25.345468 .3 > .-1: (frag 2349:920_at_7400)
> > 04:02:25.345469 .3 > .-1: (frag 2350:920_at_7400)
> > 04:02:25.345470 .3 > .-1: (frag 2351:920_at_7400)
> > 04:02:25.345867 .3 > .-1: (frag 2352:920_at_7400)
> > 04:02:25.345870 .3 > .-1: (frag 2353:920_at_7400)
> > 04:02:25.345872 .3 > .-1: (frag 2354:920_at_7400)
> > 04:02:25.345874 .3 > .-1: (frag 2355:920_at_7400)
> > 04:02:25.345877 .3 > .-1: (frag 2356:920_at_7400)
> > 04:02:25.346249 .3 > .-1: (frag 2357:920_at_7400)
> > 04:02:25.346253 .3 > .-1: (frag 2358:920_at_7400)
> > 04:02:25.346253 .3 > .-1: (frag 2359:920_at_7400)
> > 04:02:25.346254 .3 > .-1: (frag 2360:920_at_7400)
> > 04:02:25.346255 .3 > .-1: (frag 2361:920_at_7400)
> > 04:02:25.346645 .3 > .-1: (frag 2362:920_at_7400)
> > 04:02:25.346649 .3 > .-1: (frag 2363:920_at_7400)
> > 04:02:25.346651 .3 > .-1: (frag 2364:920_at_7400)
> > 04:02:25.346653 .3 > .-1: (frag 2365:920_at_7400)
> > 04:02:25.346655 .3 > .-1: (frag 2366:920_at_7400)
> > 04:02:25.347030 .3 > .-1: (frag 2367:920_at_7400)
> > 04:02:25.347034 .3 > .-1: (frag 2368:920_at_7400)
> > 04:02:25.347035 .3 > .-1: (frag 2369:920_at_7400)
> > 04:02:25.347036 .3 > .-1: (frag 2370:920_at_7400)
> >
> > I have tried using both the "net-linux scyld" and the "net-linux scyld
> > tcp" versions of Charm++ and NAMD, both of which yield the same
> > results. I have been running in in the following manner:
> > charmrun ++p 2 ++skipmaster ++verbose ++startpe 4 ++endpe 4 ++ppn 2
> > namd2 Simulation.namd >& Simulation.namd.out
> >
> > I get the exact same problem on both of the diskless nodes, whether
> > running on 1 processor or 2. Also, I have not had any problems
> > running other programs on these nodes (small Gaussian03 jobs,
> > Autodock3, etc)
> >
> > Any thoughts on the subject would be greatly appreciated.
> > Thanks,
> > Eric
> >
> >
> >--
> >********************************************************************
> > Eric R A Johnson
> > University Of Minnesota tel: (612) 529 0699
> > Dept. of Laboratory Medicine & Pathology
> > 7-230 BSBE e-mail: john4482_at_umn.edu
> > 312 Church Street web: www.eric-r-johnson.com
> > Minneapolis, MN 55455
> > USA
> >
>
> --
> ********************************************************************
> Eric R A Johnson
> University Of Minnesota tel: (612) 529 0699
> Dept. of Laboratory Medicine & Pathology
> 7-230 BSBE e-mail: john4482_at_umn.edu
> 312 Church Street web: www.eric-r-johnson.com
> Minneapolis, MN 55455
> USA
>
>

*****************************************************************
**Brian Bennion, Ph.D. **
**Computational and Systems Biology Division **
**Biology and Biotechnology Research Program **
**Lawrence Livermore National Laboratory **
**P.O. Box 808, L-448 bennion1_at_llnl.gov **
**7000 East Avenue phone: (925) 422-5722 **
**Livermore, CA 94550 fax: (925) 424-6605 **
*****************************************************************

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:38:39 CST