Re: NAMD jobs in SLURM environment, not entering queueing system

From: Prathit Chatterjee (pc20apr_at_yahoo.co.in)
Date: Thu Jul 01 2021 - 06:45:44 CDT

Next message: Prathit Chatterjee: "Re: NAMD jobs in SLURM environment, not entering queueing system"
Previous message: Giacomo Fiorin: "Re: Membrane Protein Flipping in ABF"
In reply to: Natalia Ostrowska: "Re: NAMD jobs in SLURM environment, not entering queueing system"
Next in thread: Prathit Chatterjee: "Re: NAMD jobs in SLURM environment, not entering queueing system"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

Dear Experts,

Just for your information, and for getting proper suggestions, I am sharing with you a few more details.I tried to play around, apart from compiling NAMD with CUDA, as follows...
I am pasting a part of my submission script as follows:
module load compiler/gcc-7.5.0 cuda/11.2 mpi/openmpi-4.0.2-gcc-7
echo "SLURM_NODELIST $SLURM_NODELIST"
echo "NUMBER OF CORES $SLURM_NTASKS"
echo "CUDA_VISIBLE_DEVICES=$SLURM_NODELIST"
export PATH=/home2/Prathit/apps/NAMD_PACE_source/Linux-x86_64-g++:${PATH}
............/home2/Prathit/apps/NAMD_PACE_Source/Linux-x86_64-g++/namd2 +p${SLURM_NTASKS_PER_NODE} +idlepoll ${prod_step}_run.inp > ${outputname}.out
Nevertheless, the error remians...

The job is visible in my submitted jobs list as follows:

(base) [Prathit_at_master]~/APP/PACE-CG/APP-Gamma_1000/charmm-gui-2444606374/namd>sq

Thu Jul 1 20:11:40 2021
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)

326924 gpu PCCG1000 Prathit RUNNING 2:03:10 3-00:00:00 1 gpu1
326891 g3090 2000-APP Prathit RUNNING 5:54:45 3-00:00:00 1 gpu6
326890 g3090 1500-APP Prathit RUNNING 5:57:55 3-00:00:00 1 gpu6
--------
Also, it is visible to be running with the "top" command, after logging into the gpu:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
36566 exay 20 0 18.9g 2.3g 307624 S 120.9 1.5 6368:00 python
49595 junsu 20 0 27.2g 3.4g 323256 R 100.7 2.1 2162:55 python
49633 junsu 20 0 19.9g 3.5g 323120 R 100.3 2.2 2010:20 python
65081 Prathit 20 0 3514556 1.7g 5600 R 100.3 1.1 127:02.83 namd2
49453 junsu 20 0 27.2g 2.3g 323252 R 100.0 1.5 1908:17 python
49502 junsu 20 0 30.9g 2.2g 323248 R 100.0 1.4 2008:01 python

--------
Yet, the job is not visible in queue as follows (with nvidia-smi command):

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================|
| 0 N/A N/A 36566 C python 9351MiB |
| 1 N/A N/A 49453 C ...nvs/conda_lgbm/bin/python 2413MiB |
| 2 N/A N/A 49502 C ...nvs/conda_lgbm/bin/python 2135MiB |
| 4 N/A N/A 49595 C ...nvs/conda_lgbm/bin/python 2939MiB |
| 5 N/A N/A 49633 C ...nvs/conda_lgbm/bin/python 2541MiB |
+-----------------------------------------------------------------------------+

Also, as a consequence, the job is too slow.
Any further suggestions as to how I can run a job with compiled NAMD_PACE in a proper queueing system, will be greatly helpful.
Any inconvenience on my behalf is deeply regretted,
Sincerely,
Prathit On Monday, 28 June, 2021, 07:12:32 pm GMT+9, Natalia Ostrowska <n.ostrowska_at_cent.uw.edu.pl> wrote:

Maybe slurm wants namd to be located somwhere else? I mean not in your home folder. Ask your IT department, they will probably want to install it themselves

Natalia Ostrowska
Univeristy of Warsaw, Poland
Centre of New Technologies
Biomolecular Machines Laboratory

pon., 28 cze 2021 o 11:28 René Hafner TUK <hamburge_at_physik.uni-kl.de> napisał(a):

I just understood that you have a special version there.

You probably need to (re-)compile your adapted NAMD PACE Source with CUDA support first.

On 6/28/2021 11:03 AM, René Hafner TUK wrote:


Hi

    Did you actually use a GPU version of NAMD?

    You should see this in the logfile.

    If you rely on single node GPU runs the precompiled CUDA binaries should be sufficient.

    And do add `+p${SLURM_NTASKS_PER_NODE} +idlepoll` to the namd exec line below for faster execution.

Kind regards

René

On 6/28/2021 10:54 AM, Prathit Chatterjee wrote:

  Dear Experts,
  This is regarding GPU job submission with NAMD, compiled specifically for PACE CG force field, with CHARMM-GUI, in SLURM environment.
  Kindly see my submit script below:

#!/bin/csh

#

#SBATCH -J PCCG2000

#SBATCH -N 1

#SBATCH -n 1

#SBATCH -p g3090 # Using a 3090 node

#SBATCH --gres=gpu:1 # Number of GPUs (per node)

#SBATCH -o output.log

#SBATCH -e output.err

# Generated by CHARMM-GUI (https://urldefense.com/v3/__http://www.charmm-gui.org__;!!DZ3fjg!oP3UdBWr9sYssqXLXHy2-lpGaTgBYd1jUKC2ADPNzTw1tWb7L4mbLssGr30d2ZeNzw$ ) v3.5

#

# The following shell script assumes your NAMD executable is namd2 and that

# the NAMD inputs are located in the current directory.

#

# Only one processor is used below. To parallelize NAMD, use this scheme:

# charmrun namd2 +p4 input_file.inp > output_file.out

# where the "4" in "+p4" is replaced with the actual number of processors you

# intend to use.

module load compiler/gcc-7.5.0 cuda/11.2 mpi/openmpi-4.0.2-gcc-7

echo "SLURM_NODELIST $SLURM_NODELIST"

echo "NUMBER OF CORES $SLURM_NTASKS"

set equi_prefix = step6.%d_equilibration

set prod_prefix = step7.1_production

set prod_step = step7

# Running equilibration steps

set cnt = 1

set cntmax = 6

while ( ${cnt} <= ${cntmax} )

set step = `printf ${equi_prefix} ${cnt}`

## /home2/Prathit/apps/NAMD_PACE_Source/Linux-x86_64-g++/charmrun/home2/Prathit/apps/NAMD_PACE_Source/Linux-x86_64-g++/namd2 ${step}.inp > ${step}.out

/home2/Prathit/apps/NAMD_PACE_Source/Linux-x86_64-g++/namd2 ${step}.inp > ${step}.out

@ cnt += 1

end

  ================
  While the jobs are getting submitted, these are not entering the queueing system, the PIDs of the jobs are invisible with the command "nvidia-smi", but showing with the "top" command inside the gpu node.
  Any suggestions in rectifying the current discrepancy will be greatly helpful.
  Thank you and Regards, Prathit


--

--
Dipl.-Phys. René Hafner
TU Kaiserslautern
Germany 
 -- 
--
Dipl.-Phys. René Hafner
TU Kaiserslautern
Germany

Next message: Prathit Chatterjee: "Re: NAMD jobs in SLURM environment, not entering queueing system"
Previous message: Giacomo Fiorin: "Re: Membrane Protein Flipping in ABF"
In reply to: Natalia Ostrowska: "Re: NAMD jobs in SLURM environment, not entering queueing system"
Next in thread: Prathit Chatterjee: "Re: NAMD jobs in SLURM environment, not entering queueing system"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.6 : Fri Dec 31 2021 - 23:17:11 CST