From: Eric R Johnson (bioinformaticist_at_mn.rr.com)
Date: Tue May 11 2004 - 12:34:32 CDT
Jerry,
Jerry Ebalunode wrote:
>I noticed that you are using
>this command to run NAMD
> charmrun ++p 2 ++skipmaster ++verbose ++startpe 4 ++endpe 4 ++ppn 2 ......
>From experience I have had higher success rates on completing namd jobs in 
>clusters with two procs/node using only one processor at a time per node for 
>namd.  you seem to be requesting for two processess per node. Have you been 
>able to duplicate this problem using 1 processor per node. 
>  
>
In this particular case, I used two processors per node, but using a 
single processor like:
charmrun ++p 1 ++skipmaster ++verbose ++startpe 4 ++endpe 4 ++ppn 1
yields the same results.
>I see  you run autodock3 on your cluster. Is by  chance in parallel or just 
>serial?
>
This is serial, but as I just wrote above, I have problems using NAMD 
with just a single processor also.
Thanks,
Eric
>On Tuesday 11 May 2004 10:10 am, you wrote:
>  
>
>>Hello Eric,
>>
>>Have you tried running NAMD on just one of the new nodes?
>>The onboard lan is gigabit right?
>>If you feel confident you might try swapping out the intel cards from one
>>of the older machines and putting it into the new node and running a
>>single job....
>>
>>Regards
>>Brian
>>
>>On Tue, 11 May 2004, Eric R Johnson wrote:
>>    
>>
>>>I would like to add the following information to the below question.
>>>After noting the problems listed below, I ran cpuburn-in on them for
>>>approximately 1 day with no problems.  I also ran Memtest86 on it for at
>>>least half a day, again with no problems.
>>>Thanks,
>>>Eric
>>>
>>>Eric R Johnson wrote:
>>>      
>>>
>>>>Hello,
>>>>
>>>>I have a problem that I am hoping someone on the list has seen
>>>>before.  I am currently running NAMD 2.5 (which I compiled myself
>>>>using gcc) on a Scyld (Version 28) cluster with dual Athlon MP 2600+
>>>>processors on Tyan Tiger S2466 motherboards using 2GB of memory
>>>>(registered ECC).  Originally, the cluster had 4 nodes with gigabit
>>>>ethernet (Intel cards and Linksys switch) and local hard drives.  On
>>>>these nodes, NAMD has been running perfectly.  I recently added 2 more
>>>>nodes, which are identical to the original ones (plugged into the same
>>>>switch), except that they are diskless and I am using the on-board
>>>>LAN.  If I run a NAMD job on the new nodes, the jobs will randomly
>>>>"freeze up".  This always occurs when NAMD is attempting to write a
>>>>file to the RAID located on the master node.  For example,
>>>>
>>>>ENERGY:   23400         0.0000         0.0000         0.0000
>>>>0.0000        -181725.1627     15283.0043         0.0000
>>>>0.0000     19106.0771        -147336.0813       250.5568
>>>>-147295.2153   -147296.2616       252.0297           -126.2688
>>>>-94.8730    402402.0000        70.7268        70.7469
>>>>
>>>>ENERGY:   23500         0.0000         0.0000         0.0000
>>>>0.0000        -181764.1586     15145.8883         0.0000
>>>>0.0000     19280.5044        -147337.7658       252.8442
>>>>-147296.9578   -147297.6876       251.5467           -378.1111
>>>>-348.5406    402402.0000       -90.2697       -90.2879
>>>>
>>>>WRITING EXTENDED SYSTEM TO RESTART FILE AT STEP 23500
>>>>WRITING COORDINATES TO DCD FILE AT STEP 23500
>>>>WRITING COORDINATES TO RESTART FILE AT STEP 23500
>>>>FINISHED WRITING RESTART COORDINATES
>>>>WRITING VELOCITIES TO RESTART FILE AT STEP 23500
>>>>
>>>>In this case, it is writing the velocity restart file, although I have
>>>>seen the problem occur during the DCD file as well.  On the exact same
>>>>system, it has "died" anywhere between 15,000 and 85,000 steps.  After
>>>>the freeze occurs, I do a tcpdump on the node in question and I get
>>>>the following:
>>>>
>>>>04:02:25.344307 .3 > .-1: (frag 2336:920_at_7400)
>>>>04:02:25.344682 .3 > .-1: (frag 2337:920_at_7400)
>>>>04:02:25.344686 .3 > .-1: (frag 2338:920_at_7400)
>>>>04:02:25.344687 .3 > .-1: (frag 2339:920_at_7400)
>>>>04:02:25.344688 .3 > .-1: (frag 2340:920_at_7400)
>>>>04:02:25.344689 .3 > .-1: (frag 2341:920_at_7400)
>>>>04:02:25.345077 .3 > .-1: (frag 2342:920_at_7400)
>>>>04:02:25.345081 .3 > .-1: (frag 2343:920_at_7400)
>>>>04:02:25.345082 .3 > .-1: (frag 2344:920_at_7400)
>>>>04:02:25.345085 .3 > .-1: (frag 2345:920_at_7400)
>>>>04:02:25.345088 .3 > .-1: (frag 2346:920_at_7400)
>>>>04:02:25.345464 .3 > .-1: (frag 2347:920_at_7400)
>>>>04:02:25.345467 .3 > .-1: (frag 2348:920_at_7400)
>>>>04:02:25.345468 .3 > .-1: (frag 2349:920_at_7400)
>>>>04:02:25.345469 .3 > .-1: (frag 2350:920_at_7400)
>>>>04:02:25.345470 .3 > .-1: (frag 2351:920_at_7400)
>>>>04:02:25.345867 .3 > .-1: (frag 2352:920_at_7400)
>>>>04:02:25.345870 .3 > .-1: (frag 2353:920_at_7400)
>>>>04:02:25.345872 .3 > .-1: (frag 2354:920_at_7400)
>>>>04:02:25.345874 .3 > .-1: (frag 2355:920_at_7400)
>>>>04:02:25.345877 .3 > .-1: (frag 2356:920_at_7400)
>>>>04:02:25.346249 .3 > .-1: (frag 2357:920_at_7400)
>>>>04:02:25.346253 .3 > .-1: (frag 2358:920_at_7400)
>>>>04:02:25.346253 .3 > .-1: (frag 2359:920_at_7400)
>>>>04:02:25.346254 .3 > .-1: (frag 2360:920_at_7400)
>>>>04:02:25.346255 .3 > .-1: (frag 2361:920_at_7400)
>>>>04:02:25.346645 .3 > .-1: (frag 2362:920_at_7400)
>>>>04:02:25.346649 .3 > .-1: (frag 2363:920_at_7400)
>>>>04:02:25.346651 .3 > .-1: (frag 2364:920_at_7400)
>>>>04:02:25.346653 .3 > .-1: (frag 2365:920_at_7400)
>>>>04:02:25.346655 .3 > .-1: (frag 2366:920_at_7400)
>>>>04:02:25.347030 .3 > .-1: (frag 2367:920_at_7400)
>>>>04:02:25.347034 .3 > .-1: (frag 2368:920_at_7400)
>>>>04:02:25.347035 .3 > .-1: (frag 2369:920_at_7400)
>>>>04:02:25.347036 .3 > .-1: (frag 2370:920_at_7400)
>>>>
>>>>I have tried using both the "net-linux scyld" and the "net-linux scyld
>>>>tcp" versions of Charm++ and NAMD, both of which yield the same
>>>>results.  I have been running in in the following manner:
>>>>charmrun ++p 2 ++skipmaster ++verbose ++startpe 4 ++endpe 4 ++ppn 2
>>>>namd2 Simulation.namd >& Simulation.namd.out
>>>>
>>>>I get the exact same problem on both of the diskless nodes, whether
>>>>running on 1 processor or 2.  Also, I have not had any problems
>>>>running other programs on these nodes (small Gaussian03 jobs,
>>>>Autodock3, etc)
>>>>
>>>>Any thoughts on the subject would be greatly appreciated.
>>>>Thanks,
>>>>Eric
>>>>
>>>>
>>>>--
>>>>********************************************************************
>>>> Eric R A Johnson
>>>> University Of Minnesota                      tel: (612) 529 0699
>>>> Dept. of Laboratory Medicine & Pathology
>>>> 7-230 BSBE                              e-mail: john4482_at_umn.edu
>>>> 312 Church Street                    web: www.eric-r-johnson.com
>>>> Minneapolis, MN 55455
>>>> USA
>>>>        
>>>>
>>>--
>>>********************************************************************
>>>  Eric R A Johnson
>>>  University Of Minnesota                      tel: (612) 529 0699
>>>  Dept. of Laboratory Medicine & Pathology
>>>  7-230 BSBE                              e-mail: john4482_at_umn.edu
>>>  312 Church Street                    web: www.eric-r-johnson.com
>>>  Minneapolis, MN 55455
>>>  USA
>>>      
>>>
>>*****************************************************************
>>**Brian Bennion, Ph.D.                                         **
>>**Computational and Systems Biology Division                   **
>>**Biology and Biotechnology Research Program                   **
>>**Lawrence Livermore National Laboratory                       **
>>**P.O. Box 808, L-448    bennion1_at_llnl.gov                     **
>>**7000 East Avenue       phone: (925) 422-5722                 **
>>**Livermore, CA  94550   fax:   (925) 424-6605                 **
>>*****************************************************************
>>    
>>
>
>  
>
-- ******************************************************************** Eric R A Johnson University Of Minnesota tel: (612) 529 0699 Dept. of Laboratory Medicine & Pathology 7-230 BSBE e-mail: john4482_at_umn.edu 312 Church Street web: www.eric-r-johnson.com Minneapolis, MN 55455 USA
This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:38:39 CST