Re: NAMD on KRAKEN

From: Axel Kohlmeyer (akohlmey_at_gmail.com)
Date: Fri Oct 28 2011 - 08:03:35 CDT

On Fri, Oct 28, 2011 at 8:55 AM, PAUL NEWMAN <paulclizana_at_gmail.com> wrote:

> Thanks so much for your replies. Yes I did that. I put the commands for
> changing the variables just before aprun. However it looks that the
> simulation cannot be started.
>
> The error is
>
> _pmii_daemon(SIGCHLD): PE 1 exit signal Segmentation fault
> _pmii_daemon(SIGCHLD): PE 54 exit signal Segmentation fault
> _pmii_daemon(SIGCHLD): PE 66 exit signal Segmentation fault
> _pmii_daemon(SIGCHLD): PE 198 exit signal Segmentation fault
> _pmii_daemon(SIGCHLD): PE 294 exit signal Segmentation fault
> _pmii_daemon(SIGCHLD): PE 314 exit signal Segmentation fault
> _pmii_daemon(SIGCHLD): PE 348 exit signal Segmentation fault
> _pmii_daemon(SIGCHLD): PE 402 exit signal Segmentation fault
> _pmii_daemon(SIGCHLD): PE 450 exit signal Segmentation fault
> _pmii_daemon(SIGCHLD): PE 498 exit signal Segmentation fault
> _pmii_daemon(SIGCHLD): PE 548 exit signal Segmentation fault
> _pmii_daemon(SIGCHLD): PE 630 exit signal Segmentation fault
> _pmii_daemon(SIGCHLD): PE 656 exit signal Segmentation fault
> _pmii_daemon(SIGCHLD): PE 690 exit signal Segmentation fault
> _pmii_daemon(SIGCHLD): PE 764 exit signal Segmentation fault
> _pmii_daemon(SIGCHLD): PE 788 exit signal Segmentation fault
> _pmii_daemon(SIGCHLD): PE 837 exit signal Segmentation fault
> _pmii_daemon(SIGCHLD): PE 882 exit signal Segmentation fault
> _pmii_daemon(SIGCHLD): PE 984 exit signal Segmentation fault
> _pmii_daemon(SIGCHLD): PE 1056 exit signal Segmentation fault
> _pmii_daemon(SIGCHLD): PE 1113 exit signal Segmentation fault
> _pmii_daemon(SIGCHLD): PE 1158 exit signal Segmentation fault
> _pmii_daemon(SIGCHLD): PE 1188 exit signal Segmentation fault
> _pmii_daemon(SIGCHLD): PE 1170 exit signal Segmentation fault
> [NID 04190] 2011-10-25 00:57:45 Apid 7522778: initiated application
> termination
> _pmii_daemon(SIGCHLD): PE 1073 exit signal Segmentation fault
> _pmii_daemon(SIGCHLD): PE 1035 exit signal Segmentation fault
> _pmii_daemon(SIGCHLD): PE 979 exit signal Segmentation fault
> _pmii_daemon(SIGCHLD): PE 841 exit signal Segmentation fault
> _pmii_daemon(SIGCHLD): PE 799 exit signal Segmentation fault
> _pmii_daemon(SIGCHLD): PE 738 exit signal Segmentation fault
> _pmii_daemon(SIGCHLD): PE 702 exit signal Segmentation fault
> _pmii_daemon(SIGCHLD): PE 661 exit signal Segmentation fault
> _pmii_daemon(SIGCHLD): PE 619 exit signal Segmentation fault
> _pmii_daemon(SIGCHLD): PE 558 exit signal Segmentation fault
> _pmii_daemon(SIGCHLD): PE 522 exit signal Segmentation fault
> _pmii_daemon(SIGCHLD): PE 463 exit signal Segmentation fault
> _pmii_daemon(SIGCHLD): PE 429 exit signal Segmentation fault
> .......
> .......
> Application 7522778 exit codes: 139
> Application 7522778 exit signals: Killed
> Application 7522778 resources: utime 43, stime 676
>
> Any suggestion will be highly appreciate it.

please contact XSEDE user support.
this appears to be machine specific
and those are supposed to be the
experts in running correctly on their
machines. no point in guessing around.

axel.

>
>
>
> On Fri, Oct 28, 2011 at 5:43 AM, Axel Kohlmeyer <akohlmey_at_gmail.com>wrote:
>
>> On Oct 28, 2011, at 3:05 AM, PAUL NEWMAN <paulclizana_at_gmail.com> wrote:
>>
>> > Dear NAMD users,
>> >
>> > I am running a Free Energy calculation on KRAKEN and I got the
>> following error. ( Sorry I don't know if it is appropriate to port here )
>> >
>> >
>> >
>> #############################################################################################################################
>> > ENERGY: 400 5033.9992 13345.0199 7704.4458
>> 0.0000 -1781977.4449 211831.2980 0.0000 0.0000
>> 301202.1718 -1242860.5102 309.7946 -1544062.6820
>> -1242026.2916 310.0217 36.8877 67.6739
>> 4928733.5188 28.9856 28.9725
>> >
>> > [0] MPICH has run out of unexpected buffer space.
>> > Try increasing the value of env var MPICH_UNEX_BUFFER_SIZE (cur value is
>> 62914560),
>> > and/or reducing the size of MPICH_MAX_SHORT_MSG_SIZE (cur value is
>> 128000).
>> > aborting job:
>> > out of unexpected buffer space
>> > FreeEnergy: 500 1.000 Stop 1.00000 0.00000 ( 40.701,
>> 63.777, 113.334) ( 40.718, 63.932, 113.375) 0.161 |
>> > SMD 500 44.5049 73.3361 141.589 0 0 0.298436
>> > [NID 10569] 2011-10-27 20:24:16 Apid 7553654: initiated application
>> termination
>> > Application 7553654 exit codes: 255
>> > Application 7553654 exit signals: Killed
>> > Application 7553654 resources: utime 33496, stime 298
>> >
>> ############################################################################################################################
>> >
>> >
>> > I also add the following lines in the running script after the aprun
>> but I still got the same error.
>> >
>> > aprun -n \$PBS_NNODES -cc cpu
>> /lustre/scratch/jphillip/NAMD_2.7_CRAY-XT-Kraken/namd2 $CONFFILE >& $LOGFILE
>> >
>> > setenv MPICH_PTL_SEND_CREDITS -1
>> > setenv MPICH_MAX_SHORT_MSG_SIZE 8000
>> > setenv MPICH_PTL_UNEX_EVENTS 100M
>> > setenv MPICH_UNEX_BUFFER_SIZE 500M
>> >
>> > It seems that it is not changing the default values. Any help will be
>> highly appreciate it.
>> >
>>
>>
>> Changing environment variables _after_ the aprun command is pretty
>> useless. They won't affect it. You have to move those commands up.
>>
>> Axel
>>
>>
>> > Thanks
>> >
>> > --
>> > Cheers,
>> >
>> > Paul
>> >
>> >
>>
>
>
>
> --
> Cheers,
>
> Paul
>
>
>

-- 
Dr. Axel Kohlmeyer
akohlmey_at_gmail.com  http://goo.gl/1wk0
Institute for Computational Molecular Science
Temple University, Philadelphia PA, USA.

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:20:57 CST