NAMD "freezes" up

From: Eric R Johnson (john4482_at_umn.edu)
Date: Tue May 11 2004 - 04:20:58 CDT

Hello,

I have a problem that I am hoping someone on the list has seen before.
I am currently running NAMD 2.5 (which I compiled myself using gcc) on a
Scyld (Version 28) cluster with dual Athlon MP 2600+ processors on Tyan
Tiger S2466 motherboards using 2GB of memory (registered ECC).
Originally, the cluster had 4 nodes with gigabit ethernet (Intel cards
and Linksys switch) and local hard drives. On these nodes, NAMD has
been running perfectly. I recently added 2 more nodes, which are
identical to the original ones (plugged into the same switch), except
that they are diskless and I am using the on-board LAN. If I run a NAMD
job on the new nodes, the jobs will randomly "freeze up". This always
occurs when NAMD is attempting to write a file to the RAID located on
the master node. For example,

ENERGY: 23400 0.0000 0.0000 0.0000
0.0000 -181725.1627 15283.0043 0.0000
0.0000 19106.0771 -147336.0813 250.5568
-147295.2153 -147296.2616 252.0297 -126.2688
-94.8730 402402.0000 70.7268 70.7469

ENERGY: 23500 0.0000 0.0000 0.0000
0.0000 -181764.1586 15145.8883 0.0000
0.0000 19280.5044 -147337.7658 252.8442
-147296.9578 -147297.6876 251.5467 -378.1111
-348.5406 402402.0000 -90.2697 -90.2879

WRITING EXTENDED SYSTEM TO RESTART FILE AT STEP 23500
WRITING COORDINATES TO DCD FILE AT STEP 23500
WRITING COORDINATES TO RESTART FILE AT STEP 23500
FINISHED WRITING RESTART COORDINATES
WRITING VELOCITIES TO RESTART FILE AT STEP 23500

In this case, it is writing the velocity restart file, although I have
seen the problem occur during the DCD file as well. On the exact same
system, it has "died" anywhere between 15,000 and 85,000 steps. After
the freeze occurs, I do a tcpdump on the node in question and I get the
following:

04:02:25.344307 .3 > .-1: (frag 2336:920_at_7400)
04:02:25.344682 .3 > .-1: (frag 2337:920_at_7400)
04:02:25.344686 .3 > .-1: (frag 2338:920_at_7400)
04:02:25.344687 .3 > .-1: (frag 2339:920_at_7400)
04:02:25.344688 .3 > .-1: (frag 2340:920_at_7400)
04:02:25.344689 .3 > .-1: (frag 2341:920_at_7400)
04:02:25.345077 .3 > .-1: (frag 2342:920_at_7400)
04:02:25.345081 .3 > .-1: (frag 2343:920_at_7400)
04:02:25.345082 .3 > .-1: (frag 2344:920_at_7400)
04:02:25.345085 .3 > .-1: (frag 2345:920_at_7400)
04:02:25.345088 .3 > .-1: (frag 2346:920_at_7400)
04:02:25.345464 .3 > .-1: (frag 2347:920_at_7400)
04:02:25.345467 .3 > .-1: (frag 2348:920_at_7400)
04:02:25.345468 .3 > .-1: (frag 2349:920_at_7400)
04:02:25.345469 .3 > .-1: (frag 2350:920_at_7400)
04:02:25.345470 .3 > .-1: (frag 2351:920_at_7400)
04:02:25.345867 .3 > .-1: (frag 2352:920_at_7400)
04:02:25.345870 .3 > .-1: (frag 2353:920_at_7400)
04:02:25.345872 .3 > .-1: (frag 2354:920_at_7400)
04:02:25.345874 .3 > .-1: (frag 2355:920_at_7400)
04:02:25.345877 .3 > .-1: (frag 2356:920_at_7400)
04:02:25.346249 .3 > .-1: (frag 2357:920_at_7400)
04:02:25.346253 .3 > .-1: (frag 2358:920_at_7400)
04:02:25.346253 .3 > .-1: (frag 2359:920_at_7400)
04:02:25.346254 .3 > .-1: (frag 2360:920_at_7400)
04:02:25.346255 .3 > .-1: (frag 2361:920_at_7400)
04:02:25.346645 .3 > .-1: (frag 2362:920_at_7400)
04:02:25.346649 .3 > .-1: (frag 2363:920_at_7400)
04:02:25.346651 .3 > .-1: (frag 2364:920_at_7400)
04:02:25.346653 .3 > .-1: (frag 2365:920_at_7400)
04:02:25.346655 .3 > .-1: (frag 2366:920_at_7400)
04:02:25.347030 .3 > .-1: (frag 2367:920_at_7400)
04:02:25.347034 .3 > .-1: (frag 2368:920_at_7400)
04:02:25.347035 .3 > .-1: (frag 2369:920_at_7400)
04:02:25.347036 .3 > .-1: (frag 2370:920_at_7400)

I have tried using both the "net-linux scyld" and the "net-linux scyld
tcp" versions of Charm++ and NAMD, both of which yield the same
results. I have been running in in the following manner:
charmrun ++p 2 ++skipmaster ++verbose ++startpe 4 ++endpe 4 ++ppn 2
namd2 Simulation.namd >& Simulation.namd.out

I get the exact same problem on both of the diskless nodes, whether
running on 1 processor or 2. Also, I have not had any problems running
other programs on these nodes (small Gaussian03 jobs, Autodock3, etc)

Any thoughts on the subject would be greatly appreciated.
Thanks,
Eric
 

-- 
********************************************************************
  Eric R A Johnson
  University Of Minnesota                      tel: (612) 529 0699
  Dept. of Laboratory Medicine & Pathology   
  7-230 BSBE                              e-mail: john4482_at_umn.edu
  312 Church Street                    web: www.eric-r-johnson.com
  Minneapolis, MN 55455    
  USA                              

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:37:36 CST