Re: NAMD on KRAKEN

From: PAUL NEWMAN (paulclizana_at_gmail.com)
Date: Fri Oct 28 2011 - 07:55:01 CDT

Thanks so much for your replies. Yes I did that. I put the commands for
changing the variables just before aprun. However it looks that the
simulation cannot be started.

The error is

_pmii_daemon(SIGCHLD): PE 1 exit signal Segmentation fault
_pmii_daemon(SIGCHLD): PE 54 exit signal Segmentation fault
_pmii_daemon(SIGCHLD): PE 66 exit signal Segmentation fault
_pmii_daemon(SIGCHLD): PE 198 exit signal Segmentation fault
_pmii_daemon(SIGCHLD): PE 294 exit signal Segmentation fault
_pmii_daemon(SIGCHLD): PE 314 exit signal Segmentation fault
_pmii_daemon(SIGCHLD): PE 348 exit signal Segmentation fault
_pmii_daemon(SIGCHLD): PE 402 exit signal Segmentation fault
_pmii_daemon(SIGCHLD): PE 450 exit signal Segmentation fault
_pmii_daemon(SIGCHLD): PE 498 exit signal Segmentation fault
_pmii_daemon(SIGCHLD): PE 548 exit signal Segmentation fault
_pmii_daemon(SIGCHLD): PE 630 exit signal Segmentation fault
_pmii_daemon(SIGCHLD): PE 656 exit signal Segmentation fault
_pmii_daemon(SIGCHLD): PE 690 exit signal Segmentation fault
_pmii_daemon(SIGCHLD): PE 764 exit signal Segmentation fault
_pmii_daemon(SIGCHLD): PE 788 exit signal Segmentation fault
_pmii_daemon(SIGCHLD): PE 837 exit signal Segmentation fault
_pmii_daemon(SIGCHLD): PE 882 exit signal Segmentation fault
_pmii_daemon(SIGCHLD): PE 984 exit signal Segmentation fault
_pmii_daemon(SIGCHLD): PE 1056 exit signal Segmentation fault
_pmii_daemon(SIGCHLD): PE 1113 exit signal Segmentation fault
_pmii_daemon(SIGCHLD): PE 1158 exit signal Segmentation fault
_pmii_daemon(SIGCHLD): PE 1188 exit signal Segmentation fault
_pmii_daemon(SIGCHLD): PE 1170 exit signal Segmentation fault
[NID 04190] 2011-10-25 00:57:45 Apid 7522778: initiated application
termination
_pmii_daemon(SIGCHLD): PE 1073 exit signal Segmentation fault
_pmii_daemon(SIGCHLD): PE 1035 exit signal Segmentation fault
_pmii_daemon(SIGCHLD): PE 979 exit signal Segmentation fault
_pmii_daemon(SIGCHLD): PE 841 exit signal Segmentation fault
_pmii_daemon(SIGCHLD): PE 799 exit signal Segmentation fault
_pmii_daemon(SIGCHLD): PE 738 exit signal Segmentation fault
_pmii_daemon(SIGCHLD): PE 702 exit signal Segmentation fault
_pmii_daemon(SIGCHLD): PE 661 exit signal Segmentation fault
_pmii_daemon(SIGCHLD): PE 619 exit signal Segmentation fault
_pmii_daemon(SIGCHLD): PE 558 exit signal Segmentation fault
_pmii_daemon(SIGCHLD): PE 522 exit signal Segmentation fault
_pmii_daemon(SIGCHLD): PE 463 exit signal Segmentation fault
_pmii_daemon(SIGCHLD): PE 429 exit signal Segmentation fault
......
......
Application 7522778 exit codes: 139
Application 7522778 exit signals: Killed
Application 7522778 resources: utime 43, stime 676

Any suggestion will be highly appreciate it.

On Fri, Oct 28, 2011 at 5:43 AM, Axel Kohlmeyer <akohlmey_at_gmail.com> wrote:

> On Oct 28, 2011, at 3:05 AM, PAUL NEWMAN <paulclizana_at_gmail.com> wrote:
>
> > Dear NAMD users,
> >
> > I am running a Free Energy calculation on KRAKEN and I got the following
> error. ( Sorry I don't know if it is appropriate to port here )
> >
> >
> >
> #############################################################################################################################
> > ENERGY: 400 5033.9992 13345.0199 7704.4458
> 0.0000 -1781977.4449 211831.2980 0.0000 0.0000
> 301202.1718 -1242860.5102 309.7946 -1544062.6820
> -1242026.2916 310.0217 36.8877 67.6739
> 4928733.5188 28.9856 28.9725
> >
> > [0] MPICH has run out of unexpected buffer space.
> > Try increasing the value of env var MPICH_UNEX_BUFFER_SIZE (cur value is
> 62914560),
> > and/or reducing the size of MPICH_MAX_SHORT_MSG_SIZE (cur value is
> 128000).
> > aborting job:
> > out of unexpected buffer space
> > FreeEnergy: 500 1.000 Stop 1.00000 0.00000 ( 40.701,
> 63.777, 113.334) ( 40.718, 63.932, 113.375) 0.161 |
> > SMD 500 44.5049 73.3361 141.589 0 0 0.298436
> > [NID 10569] 2011-10-27 20:24:16 Apid 7553654: initiated application
> termination
> > Application 7553654 exit codes: 255
> > Application 7553654 exit signals: Killed
> > Application 7553654 resources: utime 33496, stime 298
> >
> ############################################################################################################################
> >
> >
> > I also add the following lines in the running script after the aprun but
> I still got the same error.
> >
> > aprun -n \$PBS_NNODES -cc cpu
> /lustre/scratch/jphillip/NAMD_2.7_CRAY-XT-Kraken/namd2 $CONFFILE >& $LOGFILE
> >
> > setenv MPICH_PTL_SEND_CREDITS -1
> > setenv MPICH_MAX_SHORT_MSG_SIZE 8000
> > setenv MPICH_PTL_UNEX_EVENTS 100M
> > setenv MPICH_UNEX_BUFFER_SIZE 500M
> >
> > It seems that it is not changing the default values. Any help will be
> highly appreciate it.
> >
>
>
> Changing environment variables _after_ the aprun command is pretty
> useless. They won't affect it. You have to move those commands up.
>
> Axel
>
>
> > Thanks
> >
> > --
> > Cheers,
> >
> > Paul
> >
> >
>

-- 
Cheers,
Paul

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:20:57 CST