Re: question to run namd2 with multiple CPUs

From: Yu, Tao (tao.yu.1_at_und.edu)
Date: Mon Mar 02 2020 - 15:08:04 CST

Josh,

Thank you so much!

After module load mpi and openmpi, we tried using

/share/apps/namd/namd2.13/charmrun +p16 ++mpiexec ++remote-shell mpirun /share/apps/namd/namd2.13/namd2 pc.conf

,

but still did not work.

Here is the error we got in the output file.

Info: *****************************
Info: Reading from binary file 300.restart.coor
Info:
Info: Entering startup at 8.05077 s, 222.184 MB of memory in use
Info: Startup phase 0 took 0.000228167 s, 222.184 MB of memory in use
[0] wc[0] status 9 wc[i].opcode 0
[0] Stack Traceback:
  [0:0] [0x145023d]
  [0:1] [0x143c4fd]
  [0:2] [0x1462fb2]
  [0:3] [0x1458f82]
  [0:4] [0x53a5e0]
  [0:5] [0xcfa97e]
  [0:6] [0xed3ca0]
  [0:7] [0xd77f12]
  [0:8] [0xd77581]
  [0:9] [0x127f077]
  [0:10] [0x1459735]
  [0:11] [0xea5fcf]
  [0:12] TclInvokeStringCommand+0x88 [0x14b5338]
  [0:13] [0x14b7f57]
  [0:14] [0x14b9372]
  [0:15] [0x14b9b96]
  [0:16] [0x151bd41]
  [0:17] [0x151befe]
  [0:18] [0xe99ad0]
  [0:19] [0x517421]
  [0:20] __libc_start_main+0xf5 [0x2aaaabbc8545]
  [0:21] _ZNSt8ios_base4InitD1Ev+0x61 [0x40fc69]
"pc.log" 186L, 6779C 186,3 Bot

I think the issue is that our local cluster does not have a correct environment to use mpiexec.

Any thoughts on this?

Meanwhile, our cluster manager is plan to compile the source code. My question is that after compiling the source code, to run namd2, should I use mpirun or charm run?

Best,

Tao

________________________________
From: Josh Vermaas <joshua.vermaas_at_gmail.com>
Sent: Friday, February 28, 2020 4:29 PM
To: namd-l_at_ks.uiuc.edu <namd-l_at_ks.uiuc.edu>; Yu, Tao <tao.yu.1_at_und.edu>
Subject: Re: namd-l: question to run namd2 with multiple CPUs

Hi Tao,

The default mpirun with no arguments will only start 1 mpi rank, so that's at least part of the problem. The other part of the problem is that the smp builds don't actually use mpi at all, and because of that they need to be started with charmrun. See https://www.ks.uiuc.edu/Research/namd/2.13/ug/node98.html for more information.

I *think* what you will settle on is something like this:

/share/apps/namd/namd2.13/charmrun +p16 ++mpiexec ++remote-shell mpirun /share/apps/namd/namd2.13/namd2 pc.conf

Josh

On Fri, Feb 28, 2020, 5:17 PM Yu, Tao <tao.yu.1_at_und.edu<mailto:tao.yu.1_at_und.edu>> wrote:
Hi,

Our university just installed the Linux-x86_64-ibverbs-smp<https://www.ks.uiuc.edu/Development/Download/download.cgi?UserID=&AccessCode=&ArchiveID=1546> (InfiniBand plus shared memory, no MPI needed) on our local cluster. There is no problem to run with a single core. But when I started to test running with 16 cores, the output indicated it was still using "1" cpu.

I tried to submit the job either with or without mpirun, but the result is the same. One 1 cpu was used.

Please give me some help here.

The script I was using was attached in below

#!/bin/bash
#####Number of nodes
#SBATCH --nodes=1
#SBATCH --partition=talon
#SBATCH --ntasks-per-node=16
#SBATCH --workdir=.
#####SBATCH -o slurm_run_%j_output.txt
#####SBATCH -e slurm_run_%j_error.txt
#SBATCH -s
#SBATCH --time=00:10:00

cd $SLURM_SUBMIT_DIR
srun -n $SLURM_NTASKS hostname | sort -u > $SLURM_JOB_ID.hosts

module load intel/mpi/64

 mpirun /share/apps/namd/namd2.13/namd2 pc.conf > pc.log
**************************************************************************

The output where indicated "1" cpu was using:

ENERGY: 0 482.9967 1279.0637 1052.3094 78.2784 -71938.7727 7732.6846 0.0000 0.0000 11869.2504 -49444.1894 298.3828 -61313.4398 -49399.0775 298.3828 79.3838 81.2849 193787.4555 79.3838 81.2849

OPENING EXTENDED SYSTEM TRAJECTORY FILE
LDB: ============= START OF LOAD BALANCING ============== 2.06838
LDB: ============== END OF LOAD BALANCING =============== 2.06856
LDB: =============== DONE WITH MIGRATION ================ 2.06971
LDB: ============= START OF LOAD BALANCING ============== 8.05131
LDB: ============== END OF LOAD BALANCING =============== 8.05246
LDB: =============== DONE WITH MIGRATION ================ 8.05289
Info: Initial time: 1 CPUs 0.150798 s/step 0.872672 days/ns 236.934 MB memory
LDB: ============= START OF LOAD BALANCING ============== 9.55101
LDB: ============== END OF LOAD BALANCING =============== 9.55108
LDB: =============== DONE WITH MIGRATION ================ 9.5515
LDB: ============= START OF LOAD BALANCING ============== 15.3206
LDB: ============== END OF LOAD BALANCING =============== 15.3217
LDB: =============== DONE WITH MIGRATION ================ 15.3221
Info: Initial time: 1 CPUs 0.144713 s/step 0.837461 days/ns 237.516 MB memory
LDB: ============= START OF LOAD BALANCING ============== 22.4754
LDB: ============== END OF LOAD BALANCING =============== 22.4766
LDB: =============== DONE WITH MIGRATION ================ 22.477
Info: Initial time: 1 CPUs 0.143573 s/step 0.830862 days/ns 237.516 MB memory

This archive was generated by hypermail 2.1.6 : Fri Dec 31 2021 - 23:17:08 CST