Re: NAMD slows at startup phase 1 smp problem

From: Jim Phillips (jim_at_ks.uiuc.edu)
Date: Wed Jan 14 2015 - 17:16:19 CST

Hi Ryan,

First, if at all possible avoid MPI-smp in favor of ibverbs-smp, assuming
you do have InfiniBand. Then "charmrun ++mpiexec" will use mpiexec to
launch across nodes. This works with most MPI versions, and you can
specify a runscript to fix the rest. If you don't have a InfiniBand (or a
Cray, or just maybe 10Gbit ethernet) then multi-node runs are going to be
slow, period.

Second, for any smp build you need to specify ++ppn <threads_per_process>
or it will default to 1, which rather defeats the point of smp builds,
plus you have one communication thread per process hanging around. (I
think Charm++ is figuring 16 worker + 16 communication = 32 threads per
physical node.) For a 16-core node you will want at most ++ppn 15 to
leave a core free for the communication thread, and then for 8 nodes
+p120. You can have multiple GPUs per process, but you do not want more
processes than GPUs since CUDA does not share GPUs between processes well.

These types of issues come up often enough that I'm thinking of making
MPI-CUDA error out at build time and also raise a fatal error for running
an smp build with a single thread per node.

As for why phase 1 is slow, it uses a communication idiom that defeats the
sleep-on-idle behavior needed to cope with oversubscribed cores. I've
just added a fix that helps this some, but it's still not great.

Jim

On Mon, 12 Jan 2015, Ryan Gordon wrote:

> I am having some trouble running NAMD 2.10 for Linux-x86_64-MPI-smp-CUDA. I am running on 128 processors, 128 nodes, and 8 physical nodes. The warning I am getting at the beginning is as follows:
>
> Charm++> Warning: the number of SMP threads (32) is greater than the number of physical cores (16), so threads will sleep while idling. Use +CmiSpinOnIdle or +CmiSleepOnIdle to control this directly.
>
> I am not sure how to address this issue, and it seems to take a long time for "startup phase 1" to run compared to the other startup phases. Has anyone else had similar problems?
>
> --
> Ryan GordonPh.D. Candidate
> Chemical and Biological Engineering
> Drexel University
> CBEGSA Vice President
>

This archive was generated by hypermail 2.1.6 : Thu Dec 31 2015 - 23:21:32 CST