Re: NAMD 2.7b1 on Ranger/MPI Problems?

From: Jeff Wereszczynski (jmweresz_at_umich.edu)
Date: Fri Apr 17 2009 - 16:59:42 CDT

Hi all,
Just to follow up on this I believe I have solved the problem. On a
recommendation from the people at TACC I recompiled NAMD with the
'mvapich-old' module (instead of 'mvapich') and it now appears to work.

Jeff

On Fri, Apr 17, 2009 at 12:12 AM, Haosheng Cui <haosheng_at_hec.utah.edu>wrote:

>
> Hello all:
>
> I do have the same problem. The job usually dies after 1000 steps. If I
> restart the job several (3-5) times, it may go through once and runs
> successfully for 24 hours. Seems it only happen to the big systems (for mine
> is ~800k). The problem occurs since the beginning of 2009. Tried the same
> job on Kraken, it works fine most of the time. I have already asked tacc for
> help, but seems not helping.
>
> Thanks,
> Haosheng
>
>
> Quoting Jeff Wereszczynski <jmweresz_at_umich.edu>:
>
>
> Hi All,
>> I have a system of ~490k atoms I am trying to run on Ranger, however it
>> will
>> run for only 500-2000 steps before dying. Nothing of interest is printed
>> in
>> the log file:
>>
>> WRITING EXTENDED SYSTEM TO RESTART FILE AT STEP 2000
>> Signal 15 received.
>> Signal 15 received.
>> Signal 15 received.
>> TACC: MPI job exited with code: 1
>> TACC: Shutting down parallel environment.
>> TACC: Shutdown complete. Exiting.
>>
>>
>> Whereas in the job output/error file I get this:
>>
>> TACC: Done.
>> 193 - MPI_IPROBE : Communicator argument is not a valid communicator
>> Special bit pattern b5000000 in communicator is incorrect. May indicate
>> an
>> out-of-order argument or a freed communicator
>> [193] [] Aborting Program!
>> Exit code -3 signaled from i182-206.ranger.tacc.utexas.edu
>> Killing remote processes...Abort signaled by rank 193: Aborting program !
>> MPI process terminated unexpectedly
>> DONE
>>
>> Here is my job script:
>>
>> #!/bin/bash
>> #$ -V # Inherit the submission environment
>> #$ -N namd # Job Name
>> #$ -j y # combine stderr & stdout into stdout
>> #$ -o namd # Name of the output file (eg. myMPI.oJobID)
>> #$ -pe 16way 256 # Requests 64 cores/node, 64 cores total
>> #$ -q normal # Queue name
>> #$ -l h_rt=2:00:00 # Run time (hh:mm:ss) - 1.5 hours
>>
>> module unload mvapich2
>> module unload mvapich
>> module swap pgi intel
>> module load mvapich
>>
>> export VIADEV_SMP_EAGERSIZE=64
>> export VIADEV_SMPI_LENGTH_QUEUE=256
>> export VIADEV_ADAPTIVE_RDMA_LIMIT=0
>> export VIADEV_RENDEZVOUS_THRESHOLD=50000
>>
>> ibrun tacc_affinity
>> /share/home/00288/tg455591/NAMD_2.7b1_Linux-x86_64/namd2
>> namd.inp >namd.log
>>
>> Any ideas what I might be doing wrong? I would guess from the error
>> message
>> its some sort of MPI problem. I've tried varying the number of processors
>> (from 64 to 1104), editing out the "export ...." lines that control MPI
>> parameters, and taken out the tacc_infinity part but nothing seems to help
>> .
>> I've never had these problems with smaller systems. Has anyone else had
>> these sort of issues? Any suggestions how to fix them?
>>
>> Thanks,
>> Jeff
>>
>>
>
>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:52:37 CST