From: Eric R Johnson (john4482_at_umn.edu)
Date: Tue May 11 2004 - 04:20:58 CDT
Hello,
I have a problem that I am hoping someone on the list has seen before.  
I am currently running NAMD 2.5 (which I compiled myself using gcc) on a 
Scyld (Version 28) cluster with dual Athlon MP 2600+ processors on Tyan 
Tiger S2466 motherboards using 2GB of memory (registered ECC).  
Originally, the cluster had 4 nodes with gigabit ethernet (Intel cards 
and Linksys switch) and local hard drives.  On these nodes, NAMD has 
been running perfectly.  I recently added 2 more nodes, which are 
identical to the original ones (plugged into the same switch), except 
that they are diskless and I am using the on-board LAN.  If I run a NAMD 
job on the new nodes, the jobs will randomly "freeze up".  This always 
occurs when NAMD is attempting to write a file to the RAID located on 
the master node.  For example,
ENERGY:   23400         0.0000         0.0000         0.0000         
0.0000        -181725.1627     15283.0043         0.0000         
0.0000     19106.0771        -147336.0813       250.5568   
-147295.2153   -147296.2616       252.0297           -126.2688       
-94.8730    402402.0000        70.7268        70.7469
ENERGY:   23500         0.0000         0.0000         0.0000         
0.0000        -181764.1586     15145.8883         0.0000         
0.0000     19280.5044        -147337.7658       252.8442   
-147296.9578   -147297.6876       251.5467           -378.1111      
-348.5406    402402.0000       -90.2697       -90.2879
WRITING EXTENDED SYSTEM TO RESTART FILE AT STEP 23500
WRITING COORDINATES TO DCD FILE AT STEP 23500
WRITING COORDINATES TO RESTART FILE AT STEP 23500
FINISHED WRITING RESTART COORDINATES
WRITING VELOCITIES TO RESTART FILE AT STEP 23500
In this case, it is writing the velocity restart file, although I have 
seen the problem occur during the DCD file as well.  On the exact same 
system, it has "died" anywhere between 15,000 and 85,000 steps.  After 
the freeze occurs, I do a tcpdump on the node in question and I get the 
following:
04:02:25.344307 .3 > .-1: (frag 2336:920_at_7400)
04:02:25.344682 .3 > .-1: (frag 2337:920_at_7400)
04:02:25.344686 .3 > .-1: (frag 2338:920_at_7400)
04:02:25.344687 .3 > .-1: (frag 2339:920_at_7400)
04:02:25.344688 .3 > .-1: (frag 2340:920_at_7400)
04:02:25.344689 .3 > .-1: (frag 2341:920_at_7400)
04:02:25.345077 .3 > .-1: (frag 2342:920_at_7400)
04:02:25.345081 .3 > .-1: (frag 2343:920_at_7400)
04:02:25.345082 .3 > .-1: (frag 2344:920_at_7400)
04:02:25.345085 .3 > .-1: (frag 2345:920_at_7400)
04:02:25.345088 .3 > .-1: (frag 2346:920_at_7400)
04:02:25.345464 .3 > .-1: (frag 2347:920_at_7400)
04:02:25.345467 .3 > .-1: (frag 2348:920_at_7400)
04:02:25.345468 .3 > .-1: (frag 2349:920_at_7400)
04:02:25.345469 .3 > .-1: (frag 2350:920_at_7400)
04:02:25.345470 .3 > .-1: (frag 2351:920_at_7400)
04:02:25.345867 .3 > .-1: (frag 2352:920_at_7400)
04:02:25.345870 .3 > .-1: (frag 2353:920_at_7400)
04:02:25.345872 .3 > .-1: (frag 2354:920_at_7400)
04:02:25.345874 .3 > .-1: (frag 2355:920_at_7400)
04:02:25.345877 .3 > .-1: (frag 2356:920_at_7400)
04:02:25.346249 .3 > .-1: (frag 2357:920_at_7400)
04:02:25.346253 .3 > .-1: (frag 2358:920_at_7400)
04:02:25.346253 .3 > .-1: (frag 2359:920_at_7400)
04:02:25.346254 .3 > .-1: (frag 2360:920_at_7400)
04:02:25.346255 .3 > .-1: (frag 2361:920_at_7400)
04:02:25.346645 .3 > .-1: (frag 2362:920_at_7400)
04:02:25.346649 .3 > .-1: (frag 2363:920_at_7400)
04:02:25.346651 .3 > .-1: (frag 2364:920_at_7400)
04:02:25.346653 .3 > .-1: (frag 2365:920_at_7400)
04:02:25.346655 .3 > .-1: (frag 2366:920_at_7400)
04:02:25.347030 .3 > .-1: (frag 2367:920_at_7400)
04:02:25.347034 .3 > .-1: (frag 2368:920_at_7400)
04:02:25.347035 .3 > .-1: (frag 2369:920_at_7400)
04:02:25.347036 .3 > .-1: (frag 2370:920_at_7400)
I have tried using both the "net-linux scyld" and the "net-linux scyld 
tcp" versions of Charm++ and NAMD, both of which yield the same 
results.  I have been running in in the following manner:
charmrun ++p 2 ++skipmaster ++verbose ++startpe 4 ++endpe 4 ++ppn 2 
namd2 Simulation.namd >& Simulation.namd.out
I get the exact same problem on both of the diskless nodes, whether 
running on 1 processor or 2.  Also, I have not had any problems running 
other programs on these nodes (small Gaussian03 jobs, Autodock3, etc)
Any thoughts on the subject would be greatly appreciated.
Thanks,
Eric
 
-- ******************************************************************** Eric R A Johnson University Of Minnesota tel: (612) 529 0699 Dept. of Laboratory Medicine & Pathology 7-230 BSBE e-mail: john4482_at_umn.edu 312 Church Street web: www.eric-r-johnson.com Minneapolis, MN 55455 USA
This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:38:39 CST