From: Haosheng Cui (haosheng_at_hec.utah.edu)
Date: Fri Apr 17 2009 - 02:12:54 CDT
Hello all:
I do have the same problem. The job usually dies after 1000 steps. If
I restart the job several (3-5) times, it may go through once and runs
successfully for 24 hours. Seems it only happen to the big systems
(for mine is ~800k). The problem occurs since the beginning of 2009.
Tried the same job on Kraken, it works fine most of the time. I have
already asked tacc for help, but seems not helping.
Thanks,
Haosheng
Quoting Jeff Wereszczynski <jmweresz_at_umich.edu>:
> Hi All,
> I have a system of ~490k atoms I am trying to run on Ranger, however it will
> run for only 500-2000 steps before dying. Nothing of interest is printed in
> the log file:
>
> WRITING EXTENDED SYSTEM TO RESTART FILE AT STEP 2000
> Signal 15 received.
> Signal 15 received.
> Signal 15 received.
> TACC: MPI job exited with code: 1
> TACC: Shutting down parallel environment.
> TACC: Shutdown complete. Exiting.
>
>
> Whereas in the job output/error file I get this:
>
> TACC: Done.
> 193 - MPI_IPROBE : Communicator argument is not a valid communicator
> Special bit pattern b5000000 in communicator is incorrect. May indicate an
> out-of-order argument or a freed communicator
> [193] [] Aborting Program!
> Exit code -3 signaled from i182-206.ranger.tacc.utexas.edu
> Killing remote processes...Abort signaled by rank 193: Aborting program !
> MPI process terminated unexpectedly
> DONE
>
> Here is my job script:
>
> #!/bin/bash
> #$ -V # Inherit the submission environment
> #$ -N namd # Job Name
> #$ -j y # combine stderr & stdout into stdout
> #$ -o namd # Name of the output file (eg. myMPI.oJobID)
> #$ -pe 16way 256 # Requests 64 cores/node, 64 cores total
> #$ -q normal # Queue name
> #$ -l h_rt=2:00:00 # Run time (hh:mm:ss) - 1.5 hours
>
> module unload mvapich2
> module unload mvapich
> module swap pgi intel
> module load mvapich
>
> export VIADEV_SMP_EAGERSIZE=64
> export VIADEV_SMPI_LENGTH_QUEUE=256
> export VIADEV_ADAPTIVE_RDMA_LIMIT=0
> export VIADEV_RENDEZVOUS_THRESHOLD=50000
>
> ibrun tacc_affinity /share/home/00288/tg455591/NAMD_2.7b1_Linux-x86_64/namd2
> namd.inp >namd.log
>
> Any ideas what I might be doing wrong? I would guess from the error message
> its some sort of MPI problem. I've tried varying the number of processors
> (from 64 to 1104), editing out the "export ...." lines that control MPI
> parameters, and taken out the tacc_infinity part but nothing seems to help .
> I've never had these problems with smaller systems. Has anyone else had
> these sort of issues? Any suggestions how to fix them?
>
> Thanks,
> Jeff
>
This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:52:37 CST