From: Brian Bennion (brian_at_youkai.llnl.gov)
Date: Tue May 11 2004 - 10:10:46 CDT
Hello Eric,
Have you tried running NAMD on just one of the new nodes?
The onboard lan is gigabit right?
If you feel confident you might try swapping out the intel cards from one
of the older machines and putting it into the new node and running a
single job....
Regards
Brian
On Tue, 11 May 2004, Eric R Johnson wrote:
> I would like to add the following information to the below question.
> After noting the problems listed below, I ran cpuburn-in on them for
> approximately 1 day with no problems.  I also ran Memtest86 on it for at
> least half a day, again with no problems.
> Thanks,
> Eric
>
> Eric R Johnson wrote:
>
> > Hello,
> >
> > I have a problem that I am hoping someone on the list has seen
> > before.  I am currently running NAMD 2.5 (which I compiled myself
> > using gcc) on a Scyld (Version 28) cluster with dual Athlon MP 2600+
> > processors on Tyan Tiger S2466 motherboards using 2GB of memory
> > (registered ECC).  Originally, the cluster had 4 nodes with gigabit
> > ethernet (Intel cards and Linksys switch) and local hard drives.  On
> > these nodes, NAMD has been running perfectly.  I recently added 2 more
> > nodes, which are identical to the original ones (plugged into the same
> > switch), except that they are diskless and I am using the on-board
> > LAN.  If I run a NAMD job on the new nodes, the jobs will randomly
> > "freeze up".  This always occurs when NAMD is attempting to write a
> > file to the RAID located on the master node.  For example,
> >
> > ENERGY:   23400         0.0000         0.0000         0.0000
> > 0.0000        -181725.1627     15283.0043         0.0000
> > 0.0000     19106.0771        -147336.0813       250.5568
> > -147295.2153   -147296.2616       252.0297           -126.2688
> > -94.8730    402402.0000        70.7268        70.7469
> >
> > ENERGY:   23500         0.0000         0.0000         0.0000
> > 0.0000        -181764.1586     15145.8883         0.0000
> > 0.0000     19280.5044        -147337.7658       252.8442
> > -147296.9578   -147297.6876       251.5467           -378.1111
> > -348.5406    402402.0000       -90.2697       -90.2879
> >
> > WRITING EXTENDED SYSTEM TO RESTART FILE AT STEP 23500
> > WRITING COORDINATES TO DCD FILE AT STEP 23500
> > WRITING COORDINATES TO RESTART FILE AT STEP 23500
> > FINISHED WRITING RESTART COORDINATES
> > WRITING VELOCITIES TO RESTART FILE AT STEP 23500
> >
> > In this case, it is writing the velocity restart file, although I have
> > seen the problem occur during the DCD file as well.  On the exact same
> > system, it has "died" anywhere between 15,000 and 85,000 steps.  After
> > the freeze occurs, I do a tcpdump on the node in question and I get
> > the following:
> >
> > 04:02:25.344307 .3 > .-1: (frag 2336:920_at_7400)
> > 04:02:25.344682 .3 > .-1: (frag 2337:920_at_7400)
> > 04:02:25.344686 .3 > .-1: (frag 2338:920_at_7400)
> > 04:02:25.344687 .3 > .-1: (frag 2339:920_at_7400)
> > 04:02:25.344688 .3 > .-1: (frag 2340:920_at_7400)
> > 04:02:25.344689 .3 > .-1: (frag 2341:920_at_7400)
> > 04:02:25.345077 .3 > .-1: (frag 2342:920_at_7400)
> > 04:02:25.345081 .3 > .-1: (frag 2343:920_at_7400)
> > 04:02:25.345082 .3 > .-1: (frag 2344:920_at_7400)
> > 04:02:25.345085 .3 > .-1: (frag 2345:920_at_7400)
> > 04:02:25.345088 .3 > .-1: (frag 2346:920_at_7400)
> > 04:02:25.345464 .3 > .-1: (frag 2347:920_at_7400)
> > 04:02:25.345467 .3 > .-1: (frag 2348:920_at_7400)
> > 04:02:25.345468 .3 > .-1: (frag 2349:920_at_7400)
> > 04:02:25.345469 .3 > .-1: (frag 2350:920_at_7400)
> > 04:02:25.345470 .3 > .-1: (frag 2351:920_at_7400)
> > 04:02:25.345867 .3 > .-1: (frag 2352:920_at_7400)
> > 04:02:25.345870 .3 > .-1: (frag 2353:920_at_7400)
> > 04:02:25.345872 .3 > .-1: (frag 2354:920_at_7400)
> > 04:02:25.345874 .3 > .-1: (frag 2355:920_at_7400)
> > 04:02:25.345877 .3 > .-1: (frag 2356:920_at_7400)
> > 04:02:25.346249 .3 > .-1: (frag 2357:920_at_7400)
> > 04:02:25.346253 .3 > .-1: (frag 2358:920_at_7400)
> > 04:02:25.346253 .3 > .-1: (frag 2359:920_at_7400)
> > 04:02:25.346254 .3 > .-1: (frag 2360:920_at_7400)
> > 04:02:25.346255 .3 > .-1: (frag 2361:920_at_7400)
> > 04:02:25.346645 .3 > .-1: (frag 2362:920_at_7400)
> > 04:02:25.346649 .3 > .-1: (frag 2363:920_at_7400)
> > 04:02:25.346651 .3 > .-1: (frag 2364:920_at_7400)
> > 04:02:25.346653 .3 > .-1: (frag 2365:920_at_7400)
> > 04:02:25.346655 .3 > .-1: (frag 2366:920_at_7400)
> > 04:02:25.347030 .3 > .-1: (frag 2367:920_at_7400)
> > 04:02:25.347034 .3 > .-1: (frag 2368:920_at_7400)
> > 04:02:25.347035 .3 > .-1: (frag 2369:920_at_7400)
> > 04:02:25.347036 .3 > .-1: (frag 2370:920_at_7400)
> >
> > I have tried using both the "net-linux scyld" and the "net-linux scyld
> > tcp" versions of Charm++ and NAMD, both of which yield the same
> > results.  I have been running in in the following manner:
> > charmrun ++p 2 ++skipmaster ++verbose ++startpe 4 ++endpe 4 ++ppn 2
> > namd2 Simulation.namd >& Simulation.namd.out
> >
> > I get the exact same problem on both of the diskless nodes, whether
> > running on 1 processor or 2.  Also, I have not had any problems
> > running other programs on these nodes (small Gaussian03 jobs,
> > Autodock3, etc)
> >
> > Any thoughts on the subject would be greatly appreciated.
> > Thanks,
> > Eric
> >
> >
> >--
> >********************************************************************
> >  Eric R A Johnson
> >  University Of Minnesota                      tel: (612) 529 0699
> >  Dept. of Laboratory Medicine & Pathology
> >  7-230 BSBE                              e-mail: john4482_at_umn.edu
> >  312 Church Street                    web: www.eric-r-johnson.com
> >  Minneapolis, MN 55455
> >  USA
> >
>
> --
> ********************************************************************
>   Eric R A Johnson
>   University Of Minnesota                      tel: (612) 529 0699
>   Dept. of Laboratory Medicine & Pathology
>   7-230 BSBE                              e-mail: john4482_at_umn.edu
>   312 Church Street                    web: www.eric-r-johnson.com
>   Minneapolis, MN 55455
>   USA
>
>
*****************************************************************
**Brian Bennion, Ph.D.                                         **
**Computational and Systems Biology Division                   **
**Biology and Biotechnology Research Program                   **
**Lawrence Livermore National Laboratory                       **
**P.O. Box 808, L-448    bennion1_at_llnl.gov                     **
**7000 East Avenue       phone: (925) 422-5722                 **
**Livermore, CA  94550   fax:   (925) 424-6605                 **
*****************************************************************
This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:38:39 CST