Re: AW: REMD on HPC

From: HOUSTON Douglas (DouglasR.Houston_at_ed.ac.uk)
Date: Tue Mar 07 2017 - 02:33:29 CST

('binary' encoding is not supported, stored as-is)

Thanks for the suggestions, I tried both with 'set num_replicas 2' in fold_alanin.conf:

mpirun -np 32 ../../../namd2 +replicas 2 job0.conf +stdout output/%d/job0.%d.log

This resulted in the following error:

Charm++ fatal error:
Number of partitions does not evenly divide number of processes. Aborting

I also tried:

./../../charmrun ++verbose +p16 ++nodelist nodelist.txt ++mpiexec /exports/applications/apps/SL7/openmpi/1.6.5/bin/mpirun -np 2 namd2 +replicas 2 job0.conf +stdout output/%d/job0.%d.log

Which resulted in:

Charmrun> charmrun started...
Charmrun> mpiexec started
Charmrun> node programs all started
[node1b11.ecdf.ed.ac.uk:10204] [[6030,1],0] ORTE_ERROR_LOG: A message is attempting to be sent to a process whose contact information is unknown in file rml_oob_send.c at line 104
[node1b11.ecdf.ed.ac.uk:10204] [[6030,1],0] could not get route to [[INVALID],INVALID]
[node1b11.ecdf.ed.ac.uk:10204] [[6030,1],0] ORTE_ERROR_LOG: A message is attempting to be sent to a process whose contact information is unknown in file base/plm_base_proxy.c at line 81
[node1b11.ecdf.ed.ac.uk:10208] [[6030,1],1] ORTE_ERROR_LOG: A message is attempting to be sent to a process whose contact information is unknown in file rml_oob_send.c at line 104
[node1b11.ecdf.ed.ac.uk:10208] [[6030,1],1] could not get route to [[INVALID],INVALID]
[node1b11.ecdf.ed.ac.uk:10208] [[6030,1],1] ORTE_ERROR_LOG: A message is attempting to be sent to a process whose contact information is unknown in file base/plm_base_proxy.c at line 81
..
Charmrun> error attaching to node '127.0.0.1':
Timeout waiting for node-program to connect

Any suggestions would be appreciated.

Doug

________________________________
From: Vermaas, Joshua <Joshua.Vermaas_at_nrel.gov>
Sent: 06 March 2017 15:47
To: namd-l_at_ks.uiuc.edu; Norman Geist; HOUSTON Douglas
Subject: Re: AW: namd-l: REMD on HPC

Something else you can try to see if your performance is any better is a setup like this:

mpirun -np 128 /path/to/namd/namd2 +replicas 8 job0.conf +stdout %d/run0.%d.log

In my experience on resources I can access is that it automatically assigns each replica to a single node (or adjacent nodes if your -np and replicas arguments are compatible.

-Josh

On 03/06/2017 08:31 AM, Norman Geist wrote:
Charmrun and NAMD will distribute the cores and nodes across the replica which must be evenly possible. You assumptions look correct, so I think it should work.

Von: owner-namd-l_at_ks.uiuc.edu<mailto:owner-namd-l_at_ks.uiuc.edu> [mailto:owner-namd-l_at_ks.uiuc.edu] Im Auftrag von HOUSTON Douglas
Gesendet: Montag, 6. M?rz 2017 15:15
An: namd-l_at_ks.uiuc.edu<mailto:namd-l_at_ks.uiuc.edu>
Betreff: namd-l: REMD on HPC

Hi all,

I am trying to get REMD working on our local HPC machine. The setup consists of 16-core nodes connected via gigabit ethernet. I'd like to run the example REMD that comes with NAMD such that each replica is running on a single node, and is utilising all 16 cores. This is the example I mean:

NAMD_2.12_Linux-x86_64-netlrts/lib/replica/example/

However, I am having trouble working out what the command should look like. So far I have:

charmrun ++verbose +p16 ++nodelist nodelist.txt ++mpiexec mpirun -np 8 namd2 +replicas 8 job0.conf +stdout output/%d/job0.%d.log

but I don't really know if this will achieve what I want. Do I need to use charmrun at all? Will this result in all 16 cores in all 8 nodes being utilised, with one replica on each node? Unfortunately testing via trial and error is tricky due to the length of the job queue.

I have managed to get standard MD running across multiple nodes, so passwordless ssh etc. is not a problem.

Regards,

_________________________________________________________________
Dr. Douglas R. Houston
Senior Lecturer in Computational Biochemistry
Institute of Quantitative Biology, Biochemistry and Biotechnology
Room 3.23, Michael Swann Building
King's Buildings
University of Edinburgh
Edinburgh, EH9 3JR, UK
Tel. 0131 650 7358

The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2018 - 23:20:09 CST