Re: NAMD jobs in SLURM environment, not entering queueing system

From: Prathit Chatterjee (pc20apr_at_yahoo.co.in)
Date: Thu Jul 01 2021 - 10:41:31 CDT

Dear Dr. Vermaas and Dr. Hafner,

Thank you for the feedback.

NAMD PACE cannot be compiled with CUDA currently, I enquired the CHARMM-GUI team. Therefore, the NAMD startup is not yielding similar message (as mentioned in your previous email), as follows:

Charm++: standalone mode (not using charmrun)
Converse/Charm++ Commit ID: v6.5.0-beta1-293-gd148fb7
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 unique compute nodes (40-way SMP).
Charm++> cpu topology info is gathered in 0.001 seconds.
Info: NAMD 2.9 for Linux-x86_64-multicore
Info:
Info: Please visit http://www.ks.uiuc.edu/Research/namd/
Info: for updates, documentation, and support information.
Info:
Info: Please cite Phillips et al., J. Comp. Chem. 26:1781-1802 (2005)
Info: in all publications reporting results obtained with NAMD.
Info:
Info: Based on Charm++/Converse 60500 for multicore-linux64
Info: Built Tue May 25 19:00:30 KST 2021 by Prathit on master
Info: 1 NAMD  2.9  Linux-x86_64-multicore  1    gpu1  Prathit
Info: Running on 1 processors, 1 nodes, 1 physical nodes.
Info: CPU topology information available.
Info: Charm++/Converse parallel runtime startup completed at 0.00398183 s
Info: 34.4961 MB of memory in use based on /proc/self/stat
Info: Configuration file is step7_run.inp
Info: Working in the current directory /home2/Prathit/APP/PACE-CG/APP-Gamma_1000/charmm-gui-2444606374/namd
TCL: Suspending until startup complete.

Instead, I have to try whether with multiple processes, I am able to run the required simulations.

Thanks a lot anyways. Kindly let me know if you have any more related information.

Sincere Regards,
Prathit

On Thursday, 1 July, 2021, 09:55:36 pm GMT+9, Vermaas, Josh <vermaasj_at_msu.edu> wrote:

Is the binary under  /home2/Prathit/apps/NAMD_PACE_Source/Linux-x86_64-g++/namd2 compiled with CUDA support enabled or not? On a GPU build of NAMD, you should get output like this at the very beginning of NAMD startup:

 

Charm++> cpu topology info is gathered in 0.001 seconds.

Info: Built with CUDA version 10010

Did not find +devices i,j,k,... argument, using all

Pe 0 physical rank 0 binding to CUDA device 0 on PRL-VERMAAS-WS1: 'NVIDIA Quadro RTX 8000'  Mem: 48567MB   Rev: 7.5  PCI: 0:81:0

Info: NAMD 2.14 for Linux-x86_64-multicore-CUDA

 

Note the “Info:” lines. The first says that the NAMD build was compiled with CUDA 10.1. The second “Info” line says that this is a multicore (one node) build with CUDA support. What do those lines say for you when NAMD starts?

 

-Josh

 

From: <owner-namd-l_at_ks.uiuc.edu> on behalf of Prathit Chatterjee <pc20apr_at_REMOVE_yahoo.co.in>Reply-To: "namd-l_at_ks.uiuc.edu" <namd-l_at_ks.uiuc.edu>, Prathit Chatterjee <pc20apr_at_yahoo.co.in>Date: Thursday, July 1, 2021 at 7:54 AMTo: "namd-l_at_ks.uiuc.edu" <namd-l_at_ks.uiuc.edu>, René Hafner TUK <hamburge_at_physik.uni-kl.de>, Natalia Ostrowska <n.ostrowska_at_cent.uw.edu.pl>Subject: Re: namd-l: NAMD jobs in SLURM environment, not entering queueing system

 

Dear Experts,

Just for your information, and for getting proper suggestions, I am sharing with you a few more details.

I tried to play around, apart from compiling NAMD with CUDA, as follows...

 

I am pasting a part of my submission script as follows:

 

module load compiler/gcc-7.5.0 cuda/11.2  mpi/openmpi-4.0.2-gcc-7
echo "SLURM_NODELIST $SLURM_NODELIST"
echo "NUMBER OF CORES $SLURM_NTASKS"
echo "CUDA_VISIBLE_DEVICES=$SLURM_NODELIST"
export PATH=/home2/Prathit/apps/NAMD_PACE_source/Linux-x86_64-g++:${PATH}
...

...

....

/home2/Prathit/apps/NAMD_PACE_Source/Linux-x86_64-g++/namd2 +p${SLURM_NTASKS_PER_NODE} +idlepoll ${prod_step}_run.inp > ${outputname}.out

 

Nevertheless, the error remians...

The job is visible in my submitted jobs list as follows:

(base) [Prathit_at_master]~/APP/PACE-CG/APP-Gamma_1000/charmm-gui-2444606374/namd>sq

Thu Jul  1 20:11:40 2021
             JOBID PARTITION     NAME     USER    STATE       TIME TIME_LIMI  NODES NODELIST(REASON)

            326924       gpu PCCG1000  Prathit  RUNNING    2:03:10 3-00:00:00      1 gpu1
            326891     g3090 2000-APP  Prathit  RUNNING    5:54:45 3-00:00:00      1 gpu6
            326890     g3090 1500-APP  Prathit  RUNNING    5:57:55 3-00:00:00      1 gpu6
--------
Also, it is visible to be running with the "top" command, after logging into the gpu:

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                   
 36566 exay      20   0   18.9g   2.3g 307624 S 120.9  1.5   6368:00 python                                                                                    
 49595 junsu     20   0   27.2g   3.4g 323256 R 100.7  2.1   2162:55 python                                                                                    
 49633 junsu     20   0   19.9g   3.5g 323120 R 100.3  2.2   2010:20 python                                                                                    
 65081 Prathit   20   0 3514556   1.7g   5600 R 100.3  1.1 127:02.83 namd2                                                                                     
 49453 junsu     20   0   27.2g   2.3g 323252 R 100.0  1.5   1908:17 python                                                                                    
 49502 junsu     20   0   30.9g   2.2g 323248 R 100.0  1.4   2008:01 python             

--------
Yet, the job is not visible in queue as follows (with nvidia-smi command):

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=======================================================|
|    0   N/A  N/A     36566      C   python                           9351MiB |
|    1   N/A  N/A     49453      C   ...nvs/conda_lgbm/bin/python     2413MiB |
|    2   N/A  N/A     49502      C   ...nvs/conda_lgbm/bin/python     2135MiB |
|    4   N/A  N/A     49595      C   ...nvs/conda_lgbm/bin/python     2939MiB |
|    5   N/A  N/A     49633      C   ...nvs/conda_lgbm/bin/python     2541MiB |
+-----------------------------------------------------------------------------+

 

Also, as a consequence, the job is too slow.

 

Any further suggestions as to how I can run a job with compiled NAMD_PACE in a proper queueing system, will be greatly helpful.

Any inconvenience on my behalf is deeply regretted,
Sincerely,
Prathit

On Monday, 28 June, 2021, 07:12:32 pm GMT+9, Natalia Ostrowska <n.ostrowska_at_cent.uw.edu.pl> wrote:

 

 

Maybe slurm wants namd to be located somwhere else? I mean not in your home folder. Ask your IT department, they will probably want to install it themselves

Natalia OstrowskaUniveristy of Warsaw, PolandCentre of New TechnologiesBiomolecular Machines Laboratory

 

 

pon., 28 cze 2021 o 11:28 René Hafner TUK <hamburge_at_physik.uni-kl.de> napisał(a):

>  
>  
> I just understood that you have a special  version there.
>
> You probably need to (re-)compile your adapted NAMD PACE Source with CUDA support first.
>
>  
> On 6/28/2021 11:03 AM, René Hafner TUK wrote:
>
>
>>  
>> Hi   
>>
>>     Did you actually use a GPU version of NAMD?
>>
>>     You should see this in the logfile.
>>
>>     If you rely on single node GPU runs the precompiled CUDA binaries should be sufficient.
>>
>>     And do add `+p${SLURM_NTASKS_PER_NODE} +idlepoll` to the namd exec line below for faster execution.
>>
>> Kind regards
>>
>> René
>>
>>  
>> On 6/28/2021 10:54 AM, Prathit Chatterjee wrote:
>>
>>
>>>  
>>>  
>>>  
>>> Dear Experts,
>>>
>>>
>>>  
>>>   
>>>
>>>
>>>  
>>> This is regarding GPU job submission with NAMD, compiled specifically for PACE CG force field, with CHARMM-GUI, in SLURM environment.
>>>
>>>
>>>  
>>>   
>>>
>>>
>>>  
>>> Kindly see my submit script below:
>>>
>>>
>>>  
>>>   
>>>
>>>
>>>  
>>>  
>>> #!/bin/csh
>>>
>>> #
>>>
>>> #SBATCH -J PCCG2000
>>>
>>> #SBATCH -N 1
>>>
>>> #SBATCH -n 1
>>>
>>> #SBATCH -p g3090 # Using a 3090 node
>>>
>>> #SBATCH --gres=gpu:1    # Number of GPUs (per node)
>>>
>>> #SBATCH -o output.log
>>>
>>> #SBATCH -e output.err
>>>
>>>   
>>>
>>> # Generated by CHARMM-GUI (https://urldefense.com/v3/__http://www.charmm-gui.org__;!!DZ3fjg!uotXwdn_atHouQnAZvtrBd5MzD78SRFnRy7FFtJz79ZzSrKhaTqJTavLSXrSmgMiKA$ ) v3.5
>>>
>>> #
>>>
>>> # The following shell script assumes your NAMD executable is namd2 and that
>>>
>>> # the NAMD inputs are located in the current directory.
>>>
>>> #
>>>
>>> # Only one processor is used below. To parallelize NAMD, use this scheme:
>>>
>>> #     charmrun namd2 +p4 input_file.inp > output_file.out
>>>
>>> # where the "4" in "+p4" is replaced with the actual number of processors you
>>>
>>> # intend to use.
>>>
>>> module load compiler/gcc-7.5.0 cuda/11.2  mpi/openmpi-4.0.2-gcc-7
>>>
>>>   
>>>
>>> echo "SLURM_NODELIST $SLURM_NODELIST"
>>>
>>> echo "NUMBER OF CORES $SLURM_NTASKS"
>>>
>>>   
>>>
>>> set equi_prefix = step6.%d_equilibration
>>>
>>> set prod_prefix = step7.1_production
>>>
>>> set prod_step   = step7
>>>
>>>   
>>>
>>>   
>>>
>>> # Running equilibration steps
>>>
>>> set cnt    = 1
>>>
>>> set cntmax = 6
>>>
>>>   
>>>
>>> while ( ${cnt} <= ${cntmax} )
>>>
>>>     set step = `printf ${equi_prefix} ${cnt}`
>>>
>>> ##    /home2/Prathit/apps/NAMD_PACE_Source/Linux-x86_64-g++/charmrun /home2/Prathit/apps/NAMD_PACE_Source/Linux-x86_64-g++/namd2 ${step}.inp > ${step}.out
>>>
>>>     /home2/Prathit/apps/NAMD_PACE_Source/Linux-x86_64-g++/namd2 ${step}.inp > ${step}.out
>>>
>>>   
>>>
>>>     @ cnt +=  1
>>>
>>> end
>>>
>>>
>>>   
>>>
>>>
>>>  
>>> ================
>>>
>>>
>>>  
>>>   
>>>
>>>
>>>  
>>> While the jobs are getting submitted, these are not entering the queueing system, the PIDs of the jobs are invisible with the command "nvidia-smi", but showing with the "top" command inside the gpu node.
>>>
>>>
>>>  
>>>   
>>>
>>>
>>>  
>>> Any suggestions in rectifying the current discrepancy will be greatly helpful.
>>>
>>>
>>>  
>>>   
>>>
>>>
>>>  
>>> Thank you and Regards,
>>>
>>>
>>>  
>>> Prathit
>>>
>>>
>>>  
>>>   
>>>
>>>
>>>  
>>>   
>>>
>>>
>>>
>> -- --Dipl.-Phys. René HafnerTU KaiserslauternGermany
> -- --Dipl.-Phys. René HafnerTU KaiserslauternGermany
>

This archive was generated by hypermail 2.1.6 : Fri Dec 31 2021 - 23:17:11 CST