Re: NAMD jobs in SLURM environment, not entering queueing system

From: Prathit Chatterjee (pc20apr_at_yahoo.co.in)
Date: Thu Jul 01 2021 - 06:45:44 CDT

Dear Experts,

Just for your information, and for getting proper suggestions, I am sharing with you a few more details.I tried to play around, apart from compiling NAMD with CUDA, as follows...
I am pasting a part of my submission script as follows:
module load compiler/gcc-7.5.0 cuda/11.2  mpi/openmpi-4.0.2-gcc-7
echo "SLURM_NODELIST $SLURM_NODELIST"
echo "NUMBER OF CORES $SLURM_NTASKS"
echo "CUDA_VISIBLE_DEVICES=$SLURM_NODELIST"
export PATH=/home2/Prathit/apps/NAMD_PACE_source/Linux-x86_64-g++:${PATH}
............/home2/Prathit/apps/NAMD_PACE_Source/Linux-x86_64-g++/namd2 +p${SLURM_NTASKS_PER_NODE} +idlepoll ${prod_step}_run.inp > ${outputname}.out
Nevertheless, the error remians...

The job is visible in my submitted jobs list as follows:

(base) [Prathit_at_master]~/APP/PACE-CG/APP-Gamma_1000/charmm-gui-2444606374/namd>sq

Thu Jul  1 20:11:40 2021
             JOBID PARTITION     NAME     USER    STATE       TIME TIME_LIMI  NODES NODELIST(REASON)

            326924       gpu PCCG1000  Prathit  RUNNING    2:03:10 3-00:00:00      1 gpu1
            326891     g3090 2000-APP  Prathit  RUNNING    5:54:45 3-00:00:00      1 gpu6
            326890     g3090 1500-APP  Prathit  RUNNING    5:57:55 3-00:00:00      1 gpu6
--------
Also, it is visible to be running with the "top" command, after logging into the gpu:

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                   
 36566 exay      20   0   18.9g   2.3g 307624 S 120.9  1.5   6368:00 python                                                                                    
 49595 junsu     20   0   27.2g   3.4g 323256 R 100.7  2.1   2162:55 python                                                                                    
 49633 junsu     20   0   19.9g   3.5g 323120 R 100.3  2.2   2010:20 python                                                                                    
 65081 Prathit   20   0 3514556   1.7g   5600 R 100.3  1.1 127:02.83 namd2                                                                                     
 49453 junsu     20   0   27.2g   2.3g 323252 R 100.0  1.5   1908:17 python                                                                                    
 49502 junsu     20   0   30.9g   2.2g 323248 R 100.0  1.4   2008:01 python             

--------
Yet, the job is not visible in queue as follows (with nvidia-smi command):

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=======================================================|
|    0   N/A  N/A     36566      C   python                           9351MiB |
|    1   N/A  N/A     49453      C   ...nvs/conda_lgbm/bin/python     2413MiB |
|    2   N/A  N/A     49502      C   ...nvs/conda_lgbm/bin/python     2135MiB |
|    4   N/A  N/A     49595      C   ...nvs/conda_lgbm/bin/python     2939MiB |
|    5   N/A  N/A     49633      C   ...nvs/conda_lgbm/bin/python     2541MiB |
+-----------------------------------------------------------------------------+

Also, as a consequence, the job is too slow.
Any further suggestions as to how I can run a job with compiled NAMD_PACE in a proper queueing system, will be greatly helpful.
Any inconvenience on my behalf is deeply regretted,
Sincerely,
Prathit On Monday, 28 June, 2021, 07:12:32 pm GMT+9, Natalia Ostrowska <n.ostrowska_at_cent.uw.edu.pl> wrote:
 
 Maybe slurm wants namd to be located somwhere else? I mean not in your home folder. Ask your IT department, they will probably want to install it themselves

Natalia Ostrowska
Univeristy of Warsaw, Poland
Centre of New Technologies
Biomolecular Machines Laboratory

pon., 28 cze 2021 o 11:28 René Hafner TUK <hamburge_at_physik.uni-kl.de> napisał(a):

  
I just understood that you have a special  version there.
 
You probably need to (re-)compile your adapted NAMD PACE Source with CUDA support first.
 
 On 6/28/2021 11:03 AM, René Hafner TUK wrote:
  
 
Hi   
 
 
    Did you actually use a GPU version of NAMD?
 
    You should see this in the logfile.
 
    If you rely on single node GPU runs the precompiled CUDA binaries should be sufficient.
 
    And do add `+p${SLURM_NTASKS_PER_NODE} +idlepoll` to the namd exec line below for faster execution.
 
 
Kind regards
 
René
 
 On 6/28/2021 10:54 AM, Prathit Chatterjee wrote:
  
  Dear Experts,
  This is regarding GPU job submission with NAMD, compiled specifically for PACE CG force field, with CHARMM-GUI, in SLURM environment.
  Kindly see my submit script below:
    
#!/bin/csh
 
#
 
#SBATCH -J PCCG2000
 
#SBATCH -N 1
 
#SBATCH -n 1
 
#SBATCH -p g3090 # Using a 3090 node
 
#SBATCH --gres=gpu:1    # Number of GPUs (per node)
 
#SBATCH -o output.log
 
#SBATCH -e output.err
 

 
 
# Generated by CHARMM-GUI (https://urldefense.com/v3/__http://www.charmm-gui.org__;!!DZ3fjg!oP3UdBWr9sYssqXLXHy2-lpGaTgBYd1jUKC2ADPNzTw1tWb7L4mbLssGr30d2ZeNzw$ ) v3.5
 
#
 
# The following shell script assumes your NAMD executable is namd2 and that
 
# the NAMD inputs are located in the current directory.
 
#
 
# Only one processor is used below. To parallelize NAMD, use this scheme:
 
#     charmrun namd2 +p4 input_file.inp > output_file.out
 
# where the "4" in "+p4" is replaced with the actual number of processors you
 
# intend to use.
 
module load compiler/gcc-7.5.0 cuda/11.2  mpi/openmpi-4.0.2-gcc-7
 

 
 
echo "SLURM_NODELIST $SLURM_NODELIST"
 
echo "NUMBER OF CORES $SLURM_NTASKS"
 

 
 
set equi_prefix = step6.%d_equilibration
 
set prod_prefix = step7.1_production
 
set prod_step   = step7
 

 
 

 
 
# Running equilibration steps
 
set cnt    = 1
 
set cntmax = 6
 

 
 
while ( ${cnt} <= ${cntmax} )
 
    set step = `printf ${equi_prefix} ${cnt}`
 
##    /home2/Prathit/apps/NAMD_PACE_Source/Linux-x86_64-g++/charmrun/home2/Prathit/apps/NAMD_PACE_Source/Linux-x86_64-g++/namd2 ${step}.inp > ${step}.out
 
    /home2/Prathit/apps/NAMD_PACE_Source/Linux-x86_64-g++/namd2 ${step}.inp > ${step}.out
 

 
 
    @ cnt += 1
 
end
  
  ================
  While the jobs are getting submitted, these are not entering the queueing system, the PIDs of the jobs are invisible with the command "nvidia-smi", but showing with the "top" command inside the gpu node.
  Any suggestions in rectifying the current discrepancy will be greatly helpful.
  Thank you and Regards, Prathit
  
   
 --

--
Dipl.-Phys. René Hafner
TU Kaiserslautern
Germany 
 -- 
--
Dipl.-Phys. René Hafner
TU Kaiserslautern
Germany 
  

This archive was generated by hypermail 2.1.6 : Fri Dec 31 2021 - 23:17:11 CST