NAMD Wiki: NamdOnSLURM

  You are encouraged to improve NamdWiki by adding content, correcting errors, or removing spam.

Please note that what is below is untested and may be wrong.

I googled a little bit on using "srun". Especially, this document (http://www.rcac.purdue.edu/userinfo/resources/common/run/slurm/examples/slurm_examples.cfm) helps me to understand the command you mentioned in the email.

First, for MPI builds, the command "srun -N5 -n80 ./namd2 namd.in > log.out" means you use 5 nodes (specified by "-N"), and totally you will fire 80 tasks (specified by "-n"), i.e. 80 MPI ranks. Since the cluster has 16 individual cpus (cores) on each node, so the job scheduler will use up all the 80 cpus of the 5 nodes to run your MPI job. So this command is right when you are running a non-SMP MPI job.

However, when it comes to Charm SMP MPI jobs (the hybrid mode as mentioned in the document I have read), the command needs to change something like "srun -N5 -n5 ./namd2 ++ppn 15 namd.in > log.out". This command means the job uses 5 nodes and fires 1 task on each node. But within each task (i.e. on each node), the Charm program will fire 15 (specified by "++ppn 15" as an argument consumed by the Charm runtime system) worker threads (you could consider them as 15 processors; those 15 processors constitute a Charm SMP node), and the program will report that it is running on 5*15=75 processors. The reason that we specify 15 worker threads instead of 16 because in SMP mode, 15 threads will occupy 15 cores, and the remaining 1 core on the physical cluster node will be used for communication and accommodating system noises.

Secondly, for non-SMP net-* builds, the command "srun -N5 -c80 charmrun +p80 ./namd2 namd2.in" means you use 5 nodes and 80 cores (specified by "-c" as I guess), the "+p80" is consumed by charmrun and indicates that 80 tasks will be created. The namd program will report running on 80 processors as indicated by "+p80".

To run SMP net-* builds, the command should be changed to "srun -N5 -c80 charmrun +p75 ++ppn15 ./namd2 namd2.in". Again, this job runs on 5 nodes and 80 cores. Since we use 15 cores per charm node (as indicated by "++ppn15"), we add "+p75" as the total number of processors (15*5=75), and the namd program will report running on 75 processors.

As Jim Philips mentioned in his email that when you scale NAMD up, you may use fewer cores per node (i.e. reducing the value of "++ppn" argument) to get better performance and scalability. For example, "srun -N64 -c1024 charmrun +p896 ++ppn7 ./namd2 namd2.in" says namd is going to run on 64 nodes, but using 7 cores per charm SMP node. As each physical cluster node has 16 cores, there will be 2 charm SMP nodes on one physical cluster node, i.e. we are using 14 cores out of 16 cores per physical cluster node. And through 64*14=896, we get the value for argument "+p".

I hope this will clarify how to run charm programs on your clusters.

-Chao Mei