Re: NAMD2.7 scaling problems

From: DimitryASuplatov (genesup_at_gmail.com)
Date: Wed Oct 28 2009 - 08:30:44 CDT

Thank you very much for this reply.

I`ve tested my installation of namd2.7 on a 80000 atoms system and it
went fine on 256 CPUs. So maybe that was a problem of small system.
I will also try your recomendations.

Thank you.

Bjoern Olausson wrote:
> On Wednesday 28 October 2009 10:06:13 DimitryASuplatov wrote:
>
>> Hello,
>>
>> I am trying to run a 39728 atoms system using NAMD 2.7b1.
>>
>> It runs without a problem on 1,4,8,16,32,64 CPUs, but when I try to
>> launch it on 128 and 256 CPUs it throws an error
>>
>> MPI process terminated unexpectedly
>> Exit code -5 signaled from node-04-05
>> Killing remote processes...Signal 15 received.
>> Signal 15 received.
>> Signal 15 received.
>> Signal 15 received.
>> DONE
>>
>> which is most likely related to mpirun then to namd executable.
>>
>> For the same situation NAMD 2.6 runs perfectly on 256 CPUs.
>>
>> NAMD2.7 was compiled as suggested in the note.txt file and using tips
>> from namdwiki related to mpich and infiniband.
>>
>> Does it happen because my system it too small to be scaled on 256? But
>> why then namd 2.6 works fine, I thought that 2.7 has better scalability?
>> Could it happen because of incorrect installation? What could be the
>> problem?
>>
>> Thank you very much for your time.
>>
>> SDA
>>
>>
>
> What version of charm++ are you using? Which MPI implementation are you using?
> Which compiler are you using? What kind of InfiniBand do you have...
>
> Just one suggestion:
>
> Try charm++ from CVS and NAMD from CVS and use the charm++ native ibverbs
> instead of MPI.
>
> For example compiling charm++:
> ---------------------------------------------------------------
> # Not sure if this export is necessary
> # Use SSH instead of RSH
> export CONV_RSH="ssh"
> ./build charm++ net-linux-x86_64 icc ibverbs -j8 -O
> ---------------------------------------------------------------
> Compiling NAMD2:
> ---------------------------------------------------------------
> ./config Linux-x86_64-icc.net-linux-x86_64-ibverbs-icc \
> --arch-suffix net-linux-x86_64-ibverbs-icc \
> --charm-arch net-linux-x86_64-ibverbs-icc \
> --charm-base /home/blub/src/NAMD_CVS_Source/charm
>
> cd Linux-x86_64-icc.net-linux-x86_64-ibverbs-icc ; make -j8
> ---------------------------------------------------------------
>
> Based on my tests, this version scales nearly linear up to 64 Cores and yields
> 20% more performance compared to NAMD run with mvapich 1.2 over InfiniBand
>
> Be sure to use at least the latest CVS snapshot of charm++, cause previous
> charm++ releases have a memory leak which causes them to die after ~4000 steps
> when using IB.
>
> If you use ICC to compile charm++ be also warned, that the ICC 11.1.046 has a
> bug which corrupts IB builds of charm++ based on charm++ sources prior to 23
> October 2009. ICC 10 will work fine.
>
> Using the ibverbs version might also solve your problem ;-)
>
> If you don't want, or can use charm++ ibverbs, compile charm++ and namd with
> debugger support (add "-g" to compile flags) and attach gdb to the last
> process to get some more information about the crash.
>
>
>
> You can run the ibverbs version as follows:
>
> -----------------------------------------------------------------------
> #!/bin/bash
> #
> #
> charmrun="/path/to/charmrun"
> namd2="/path/to/namd2"
>
> #Path to your machinefile
> TMPDIR="/path/to/machinefile"
> # Number of cores to use
> NSLOTS="32"
>
> # Options to overcome RSH resource limits
> RSHopts="++batch 32 ++timeout 300"
> # When using SSH use ++scalable-start to overcome resource limits
> #RSHopts="++remote-shell ssh"
> #RSHopts="++scalable-start"
>
>
> #Below 228 cores I don't need to take care of RSH resource limits
> if [[ ${NSLOTS} -le 228 ]] ; then
> RSHopts=""
> fi
>
> ${charmrun} ${RSHopts} ${namd2} ++nodelist ${TMPDIR}/machines \
> +p ${NSLOTS} +setcpuaffinity npgt_firstrun.conf > npgt_firstrun.out
> ------------------------------------------------------------------------
>
> Kind regards
> Bjoern Olausson
>
>
> ------------------------------------------------------------------------
>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:53:25 CST