Re: NAMD jobs in SLURM environment, not entering queueing system

From: Prathit Chatterjee (pc20apr_at_yahoo.co.in)
Date: Thu Jul 01 2021 - 10:41:31 CDT

Next message: Amir Zeb: "Simulation of protein-lipid complex by NAMD with OPLS3e ff?"
Previous message: Prathit Chatterjee: "Re: NAMD jobs in SLURM environment, not entering queueing system"
Maybe in reply to: René Hafner TUK: "Re: NAMD jobs in SLURM environment, not entering queueing system"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

Dear Dr. Vermaas and Dr. Hafner,

Thank you for the feedback.

NAMD PACE cannot be compiled with CUDA currently, I enquired the CHARMM-GUI team. Therefore, the NAMD startup is not yielding similar message (as mentioned in your previous email), as follows:

Charm++: standalone mode (not using charmrun)
Converse/Charm++ Commit ID: v6.5.0-beta1-293-gd148fb7
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 unique compute nodes (40-way SMP).
Charm++> cpu topology info is gathered in 0.001 seconds.
Info: NAMD 2.9 for Linux-x86_64-multicore
Info:
Info: Please visit http://www.ks.uiuc.edu/Research/namd/
Info: for updates, documentation, and support information.
Info:
Info: Please cite Phillips et al., J. Comp. Chem. 26:1781-1802 (2005)
Info: in all publications reporting results obtained with NAMD.
Info:
Info: Based on Charm++/Converse 60500 for multicore-linux64
Info: Built Tue May 25 19:00:30 KST 2021 by Prathit on master
Info: 1 NAMD 2.9 Linux-x86_64-multicore 1 gpu1 Prathit
Info: Running on 1 processors, 1 nodes, 1 physical nodes.
Info: CPU topology information available.
Info: Charm++/Converse parallel runtime startup completed at 0.00398183 s
Info: 34.4961 MB of memory in use based on /proc/self/stat
Info: Configuration file is step7_run.inp
Info: Working in the current directory /home2/Prathit/APP/PACE-CG/APP-Gamma_1000/charmm-gui-2444606374/namd
TCL: Suspending until startup complete.

Instead, I have to try whether with multiple processes, I am able to run the required simulations.

Thanks a lot anyways. Kindly let me know if you have any more related information.

Sincere Regards,
Prathit

On Thursday, 1 July, 2021, 09:55:36 pm GMT+9, Vermaas, Josh <vermaasj_at_msu.edu> wrote:

Is the binary under /home2/Prathit/apps/NAMD_PACE_Source/Linux-x86_64-g++/namd2 compiled with CUDA support enabled or not? On a GPU build of NAMD, you should get output like this at the very beginning of NAMD startup:

Charm++> cpu topology info is gathered in 0.001 seconds.

Info: Built with CUDA version 10010

Did not find +devices i,j,k,... argument, using all

Pe 0 physical rank 0 binding to CUDA device 0 on PRL-VERMAAS-WS1: 'NVIDIA Quadro RTX 8000' Mem: 48567MB Rev: 7.5 PCI: 0:81:0

Info: NAMD 2.14 for Linux-x86_64-multicore-CUDA

Note the “Info:” lines. The first says that the NAMD build was compiled with CUDA 10.1. The second “Info” line says that this is a multicore (one node) build with CUDA support. What do those lines say for you when NAMD starts?

-Josh

From: <owner-namd-l_at_ks.uiuc.edu> on behalf of Prathit Chatterjee <pc20apr_at_REMOVE_yahoo.co.in>Reply-To: "namd-l_at_ks.uiuc.edu" <namd-l_at_ks.uiuc.edu>, Prathit Chatterjee <pc20apr_at_yahoo.co.in>Date: Thursday, July 1, 2021 at 7:54 AMTo: "namd-l_at_ks.uiuc.edu" <namd-l_at_ks.uiuc.edu>, René Hafner TUK <hamburge_at_physik.uni-kl.de>, Natalia Ostrowska <n.ostrowska_at_cent.uw.edu.pl>Subject: Re: namd-l: NAMD jobs in SLURM environment, not entering queueing system

Dear Experts,

Just for your information, and for getting proper suggestions, I am sharing with you a few more details.

I tried to play around, apart from compiling NAMD with CUDA, as follows...

I am pasting a part of my submission script as follows:

module load compiler/gcc-7.5.0 cuda/11.2 mpi/openmpi-4.0.2-gcc-7
echo "SLURM_NODELIST $SLURM_NODELIST"
echo "NUMBER OF CORES $SLURM_NTASKS"
echo "CUDA_VISIBLE_DEVICES=$SLURM_NODELIST"
export PATH=/home2/Prathit/apps/NAMD_PACE_source/Linux-x86_64-g++:${PATH}
...

...

....

/home2/Prathit/apps/NAMD_PACE_Source/Linux-x86_64-g++/namd2 +p${SLURM_NTASKS_PER_NODE} +idlepoll ${prod_step}_run.inp > ${outputname}.out

Nevertheless, the error remians...

The job is visible in my submitted jobs list as follows:

(base) [Prathit_at_master]~/APP/PACE-CG/APP-Gamma_1000/charmm-gui-2444606374/namd>sq

Thu Jul 1 20:11:40 2021
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)

326924 gpu PCCG1000 Prathit RUNNING 2:03:10 3-00:00:00 1 gpu1
326891 g3090 2000-APP Prathit RUNNING 5:54:45 3-00:00:00 1 gpu6
326890 g3090 1500-APP Prathit RUNNING 5:57:55 3-00:00:00 1 gpu6
--------
Also, it is visible to be running with the "top" command, after logging into the gpu:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
36566 exay 20 0 18.9g 2.3g 307624 S 120.9 1.5 6368:00 python
49595 junsu 20 0 27.2g 3.4g 323256 R 100.7 2.1 2162:55 python
49633 junsu 20 0 19.9g 3.5g 323120 R 100.3 2.2 2010:20 python
65081 Prathit 20 0 3514556 1.7g 5600 R 100.3 1.1 127:02.83 namd2
49453 junsu 20 0 27.2g 2.3g 323252 R 100.0 1.5 1908:17 python
49502 junsu 20 0 30.9g 2.2g 323248 R 100.0 1.4 2008:01 python

--------
Yet, the job is not visible in queue as follows (with nvidia-smi command):

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================|
| 0 N/A N/A 36566 C python 9351MiB |
| 1 N/A N/A 49453 C ...nvs/conda_lgbm/bin/python 2413MiB |
| 2 N/A N/A 49502 C ...nvs/conda_lgbm/bin/python 2135MiB |
| 4 N/A N/A 49595 C ...nvs/conda_lgbm/bin/python 2939MiB |
| 5 N/A N/A 49633 C ...nvs/conda_lgbm/bin/python 2541MiB |
+-----------------------------------------------------------------------------+

Also, as a consequence, the job is too slow.

Any further suggestions as to how I can run a job with compiled NAMD_PACE in a proper queueing system, will be greatly helpful.

Any inconvenience on my behalf is deeply regretted,
Sincerely,
Prathit

On Monday, 28 June, 2021, 07:12:32 pm GMT+9, Natalia Ostrowska <n.ostrowska_at_cent.uw.edu.pl> wrote:

Maybe slurm wants namd to be located somwhere else? I mean not in your home folder. Ask your IT department, they will probably want to install it themselves

Natalia OstrowskaUniveristy of Warsaw, PolandCentre of New TechnologiesBiomolecular Machines Laboratory

pon., 28 cze 2021 o 11:28 René Hafner TUK <hamburge_at_physik.uni-kl.de> napisał(a):

>
>
> I just understood that you have a special version there.
>
> You probably need to (re-)compile your adapted NAMD PACE Source with CUDA support first.
>
>
> On 6/28/2021 11:03 AM, René Hafner TUK wrote:
>
>
>>
>> Hi
>>
>>     Did you actually use a GPU version of NAMD?
>>
>>     You should see this in the logfile.
>>
>>     If you rely on single node GPU runs the precompiled CUDA binaries should be sufficient.
>>
>>     And do add `+p${SLURM_NTASKS_PER_NODE} +idlepoll` to the namd exec line below for faster execution.
>>
>> Kind regards
>>
>> René
>>
>>
>> On 6/28/2021 10:54 AM, Prathit Chatterjee wrote:
>>
>>
>>>
>>>
>>>
>>> Dear Experts,
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> This is regarding GPU job submission with NAMD, compiled specifically for PACE CG force field, with CHARMM-GUI, in SLURM environment.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Kindly see my submit script below:
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> #!/bin/csh
>>>
>>> #
>>>
>>> #SBATCH -J PCCG2000
>>>
>>> #SBATCH -N 1
>>>
>>> #SBATCH -n 1
>>>
>>> #SBATCH -p g3090 # Using a 3090 node
>>>
>>> #SBATCH --gres=gpu:1 # Number of GPUs (per node)
>>>
>>> #SBATCH -o output.log
>>>
>>> #SBATCH -e output.err
>>>
>>>
>>>
>>> # Generated by CHARMM-GUI (https://urldefense.com/v3/__http://www.charmm-gui.org__;!!DZ3fjg!uotXwdn_atHouQnAZvtrBd5MzD78SRFnRy7FFtJz79ZzSrKhaTqJTavLSXrSmgMiKA$ ) v3.5
>>>
>>> #
>>>
>>> # The following shell script assumes your NAMD executable is namd2 and that
>>>
>>> # the NAMD inputs are located in the current directory.
>>>
>>> #
>>>
>>> # Only one processor is used below. To parallelize NAMD, use this scheme:
>>>
>>> # charmrun namd2 +p4 input_file.inp > output_file.out
>>>
>>> # where the "4" in "+p4" is replaced with the actual number of processors you
>>>
>>> # intend to use.
>>>
>>> module load compiler/gcc-7.5.0 cuda/11.2 mpi/openmpi-4.0.2-gcc-7
>>>
>>>
>>>
>>> echo "SLURM_NODELIST $SLURM_NODELIST"
>>>
>>> echo "NUMBER OF CORES $SLURM_NTASKS"
>>>
>>>
>>>
>>> set equi_prefix = step6.%d_equilibration
>>>
>>> set prod_prefix = step7.1_production
>>>
>>> set prod_step = step7
>>>
>>>
>>>
>>>
>>>
>>> # Running equilibration steps
>>>
>>> set cnt = 1
>>>
>>> set cntmax = 6
>>>
>>>
>>>
>>> while ( ${cnt} <= ${cntmax} )
>>>
>>> set step = `printf ${equi_prefix} ${cnt}`
>>>
>>> ## /home2/Prathit/apps/NAMD_PACE_Source/Linux-x86_64-g++/charmrun /home2/Prathit/apps/NAMD_PACE_Source/Linux-x86_64-g++/namd2 ${step}.inp > ${step}.out
>>>
>>> /home2/Prathit/apps/NAMD_PACE_Source/Linux-x86_64-g++/namd2 ${step}.inp > ${step}.out
>>>
>>>
>>>
>>> @ cnt +=  1
>>>
>>> end
>>>
>>>
>>>
>>>
>>>
>>>
>>> ================
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> While the jobs are getting submitted, these are not entering the queueing system, the PIDs of the jobs are invisible with the command "nvidia-smi", but showing with the "top" command inside the gpu node.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Any suggestions in rectifying the current discrepancy will be greatly helpful.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Thank you and Regards,
>>>
>>>
>>>
>>> Prathit
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>> -- --Dipl.-Phys. René HafnerTU KaiserslauternGermany
> -- --Dipl.-Phys. René HafnerTU KaiserslauternGermany
>

Next message: Amir Zeb: "Simulation of protein-lipid complex by NAMD with OPLS3e ff?"
Previous message: Prathit Chatterjee: "Re: NAMD jobs in SLURM environment, not entering queueing system"
Maybe in reply to: René Hafner TUK: "Re: NAMD jobs in SLURM environment, not entering queueing system"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.6 : Fri Dec 31 2021 - 23:17:11 CST