NAMD "freezes" up (further clarification)

From: Eric R Johnson (john4482_at_umn.edu)
Date: Tue May 11 2004 - 08:49:33 CDT

I would like to add the following information to the below question.
After noting the problems listed below, I ran cpuburn-in on them for
approximately 1 day with no problems. I also ran Memtest86 on it for at
least half a day, again with no problems.
Thanks,
Eric

Eric R Johnson wrote:

> Hello,
>
> I have a problem that I am hoping someone on the list has seen
> before. I am currently running NAMD 2.5 (which I compiled myself
> using gcc) on a Scyld (Version 28) cluster with dual Athlon MP 2600+
> processors on Tyan Tiger S2466 motherboards using 2GB of memory
> (registered ECC). Originally, the cluster had 4 nodes with gigabit
> ethernet (Intel cards and Linksys switch) and local hard drives. On
> these nodes, NAMD has been running perfectly. I recently added 2 more
> nodes, which are identical to the original ones (plugged into the same
> switch), except that they are diskless and I am using the on-board
> LAN. If I run a NAMD job on the new nodes, the jobs will randomly
> "freeze up". This always occurs when NAMD is attempting to write a
> file to the RAID located on the master node. For example,
>
> ENERGY: 23400 0.0000 0.0000 0.0000
> 0.0000 -181725.1627 15283.0043 0.0000
> 0.0000 19106.0771 -147336.0813 250.5568
> -147295.2153 -147296.2616 252.0297 -126.2688
> -94.8730 402402.0000 70.7268 70.7469
>
> ENERGY: 23500 0.0000 0.0000 0.0000
> 0.0000 -181764.1586 15145.8883 0.0000
> 0.0000 19280.5044 -147337.7658 252.8442
> -147296.9578 -147297.6876 251.5467 -378.1111
> -348.5406 402402.0000 -90.2697 -90.2879
>
> WRITING EXTENDED SYSTEM TO RESTART FILE AT STEP 23500
> WRITING COORDINATES TO DCD FILE AT STEP 23500
> WRITING COORDINATES TO RESTART FILE AT STEP 23500
> FINISHED WRITING RESTART COORDINATES
> WRITING VELOCITIES TO RESTART FILE AT STEP 23500
>
> In this case, it is writing the velocity restart file, although I have
> seen the problem occur during the DCD file as well. On the exact same
> system, it has "died" anywhere between 15,000 and 85,000 steps. After
> the freeze occurs, I do a tcpdump on the node in question and I get
> the following:
>
> 04:02:25.344307 .3 > .-1: (frag 2336:920_at_7400)
> 04:02:25.344682 .3 > .-1: (frag 2337:920_at_7400)
> 04:02:25.344686 .3 > .-1: (frag 2338:920_at_7400)
> 04:02:25.344687 .3 > .-1: (frag 2339:920_at_7400)
> 04:02:25.344688 .3 > .-1: (frag 2340:920_at_7400)
> 04:02:25.344689 .3 > .-1: (frag 2341:920_at_7400)
> 04:02:25.345077 .3 > .-1: (frag 2342:920_at_7400)
> 04:02:25.345081 .3 > .-1: (frag 2343:920_at_7400)
> 04:02:25.345082 .3 > .-1: (frag 2344:920_at_7400)
> 04:02:25.345085 .3 > .-1: (frag 2345:920_at_7400)
> 04:02:25.345088 .3 > .-1: (frag 2346:920_at_7400)
> 04:02:25.345464 .3 > .-1: (frag 2347:920_at_7400)
> 04:02:25.345467 .3 > .-1: (frag 2348:920_at_7400)
> 04:02:25.345468 .3 > .-1: (frag 2349:920_at_7400)
> 04:02:25.345469 .3 > .-1: (frag 2350:920_at_7400)
> 04:02:25.345470 .3 > .-1: (frag 2351:920_at_7400)
> 04:02:25.345867 .3 > .-1: (frag 2352:920_at_7400)
> 04:02:25.345870 .3 > .-1: (frag 2353:920_at_7400)
> 04:02:25.345872 .3 > .-1: (frag 2354:920_at_7400)
> 04:02:25.345874 .3 > .-1: (frag 2355:920_at_7400)
> 04:02:25.345877 .3 > .-1: (frag 2356:920_at_7400)
> 04:02:25.346249 .3 > .-1: (frag 2357:920_at_7400)
> 04:02:25.346253 .3 > .-1: (frag 2358:920_at_7400)
> 04:02:25.346253 .3 > .-1: (frag 2359:920_at_7400)
> 04:02:25.346254 .3 > .-1: (frag 2360:920_at_7400)
> 04:02:25.346255 .3 > .-1: (frag 2361:920_at_7400)
> 04:02:25.346645 .3 > .-1: (frag 2362:920_at_7400)
> 04:02:25.346649 .3 > .-1: (frag 2363:920_at_7400)
> 04:02:25.346651 .3 > .-1: (frag 2364:920_at_7400)
> 04:02:25.346653 .3 > .-1: (frag 2365:920_at_7400)
> 04:02:25.346655 .3 > .-1: (frag 2366:920_at_7400)
> 04:02:25.347030 .3 > .-1: (frag 2367:920_at_7400)
> 04:02:25.347034 .3 > .-1: (frag 2368:920_at_7400)
> 04:02:25.347035 .3 > .-1: (frag 2369:920_at_7400)
> 04:02:25.347036 .3 > .-1: (frag 2370:920_at_7400)
>
> I have tried using both the "net-linux scyld" and the "net-linux scyld
> tcp" versions of Charm++ and NAMD, both of which yield the same
> results. I have been running in in the following manner:
> charmrun ++p 2 ++skipmaster ++verbose ++startpe 4 ++endpe 4 ++ppn 2
> namd2 Simulation.namd >& Simulation.namd.out
>
> I get the exact same problem on both of the diskless nodes, whether
> running on 1 processor or 2. Also, I have not had any problems
> running other programs on these nodes (small Gaussian03 jobs,
> Autodock3, etc)
>
> Any thoughts on the subject would be greatly appreciated.
> Thanks,
> Eric
>
>
>--
>********************************************************************
> Eric R A Johnson
> University Of Minnesota tel: (612) 529 0699
> Dept. of Laboratory Medicine & Pathology
> 7-230 BSBE e-mail: john4482_at_umn.edu
> 312 Church Street web: www.eric-r-johnson.com
> Minneapolis, MN 55455
> USA
>

-- 
********************************************************************
  Eric R A Johnson
  University Of Minnesota                      tel: (612) 529 0699
  Dept. of Laboratory Medicine & Pathology   
  7-230 BSBE                              e-mail: john4482_at_umn.edu
  312 Church Street                    web: www.eric-r-johnson.com
  Minneapolis, MN 55455    
  USA                              

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:37:36 CST