From: Jim Phillips (jim_at_ks.uiuc.edu)
Date: Thu Nov 01 2018 - 11:33:45 CDT
I strongly recommend using ++mpiexec or ++mpiexec-no-n to interface
charmrun with the queueing system. From the NAMD release notes:
Writing batch job scripts to run charmrun in a queueing system can be
challenging. Since most clusters provide directions for using mpiexec
to launch MPI jobs, charmrun provides a ++mpiexec option to use mpiexec
to launch non-MPI binaries. If "mpiexec -n <procs> ..." is not
sufficient to launch jobs on your cluster you will need to write an
executable mympiexec script like the following from TACC:
#!/bin/csh
shift; shift; exec ibrun $*
The job is then launched (with full paths where needed) as:
charmrun +p<procs> ++mpiexec ++remote-shell mympiexec namd2 <configfile>
Charm++ now provides the option ++mpiexec-no-n for the common case
where mpiexec does not accept "-n <procs>" and instead derives the
number of processes to launch directly from the queueing system:
charmrun +p<procs> ++mpiexec-no-n ++remote-shell ibrun namd2 <configfile>
Jim
On Thu, 1 Nov 2018, Hazard, E. Starr wrote:
> RHEL v6 LSF manager
>
>
> I compiled NAMD/charm
>
>
> ~/COMPILE3/NAMD_Git-2018-09-21_Source/charm-6.8.2
>
>
> here's my smart-build log
>
> cat ~/COMPILE3/NAMD_Git-2018-09-21_Source/charm-6.8.2/smart-build.log
> Fri Sep 21 12:40:12 EDT 2018
> Using the following build command:
> ./build charm++ mpi-linux-x86_64 -j4 -g -O0
>
> Fri Sep 21 12:47:14 EDT 2018
> Using the following build command:
> ./build charm++ mpi-linux-x86_64 smp -j4 -g -O0
>
> Fri Sep 21 12:48:28 EDT 2018
> Using the following build command:
> ./build charm++ mpi-linux-x86_64 -j4 -g -O0
>
> Fri Sep 21 12:50:01 EDT 2018
> Using the following build command:
> ./build charm++ netlrts-linux-x86_64 gcc gfortran -j4 -g -O0
>
> Wed Oct 3 17:00:28 EDT 2018
> Using the following build command:
> ./build LIBS netlrts-linux-x86_64 gcc -j4 --with-production
>
>
>
> my LSF file
>
> #!/bin/bash
> #BSUB -J NAMD2018
> #BSUB -o NAMD2018_OUT%J
> #BSUB -e NAMDERR.e%J
> #BSUB -n 80
> #BSUB -u hazards_at_musc.edu
> export PWD=/home/hazards/NAMD/:$PWD
> export PATH=/home/hazards/NAMD/toppar:$PATH
> /shared/app/NAMD_Git-2018-09-21_Source/charmrun +p80 ++verbose ++remote-shell ssh ++nodelist /home/hazards/NAMD/nodelist \
> /shared/app/NAMD_Git-2018-09-21_Source/namd2 +isomalloc_sync /home/hazards/NAMD/bpti.namd > \
> /home/hazards/NAMD/BPTI-namdcompilecharm_allnodes80.out
>
> The LSF file captures this
>
> cat NAMDERR.e9041
> Charmrun> charmrun started...
> Charmrun> using /home/hazards/NAMD/nodelist as nodesfile
> Charmrun> remote shell (10.200.1.3:0) started
> Charmrun> remote shell (10.200.1.5:7) started
> Charmrun> remote shell (10.200.1.6:14) started
> Charmrun> remote shell (10.200.1.7:21) started
> Charmrun> remote shell (10.200.1.8:28) started
> Charmrun> remote shell (10.200.1.9:35) started
> Charmrun> remote shell (10.200.1.10:42) started
> Charmrun> remote shell (10.200.1.12:49) started
> Charmrun> remote shell (10.200.1.13:56) started
> Charmrun> remote shell (10.200.1.15:62) started
> Charmrun> remote shell (10.200.1.16:68) started
> Charmrun> remote shell (10.200.1.17:74) started
> Charmrun> node programs all started
> Charmrun> error attaching to node '10.200.1.3':
> Timeout waiting for node-program to connect
>
>
>
> The NAMD output looks like this
> more BPTI-namdcompilecharm_allnodes80.out
> Charmrun remote shell(10.200.1.13.56)> remote responding...
> Charmrun remote shell(10.200.1.13.56)> starting node-program...
> Charmrun remote shell(10.200.1.13.56)> remote shell phase successful.
> Charmrun remote shell(10.200.1.17.74)> remote responding...
> Charmrun remote shell(10.200.1.16.68)> remote responding...
> Charmrun remote shell(10.200.1.6.14)> remote responding...
> Charmrun remote shell(10.200.1.17.74)> starting node-program...
> Charmrun remote shell(10.200.1.17.74)> remote shell phase successful.
> Charmrun remote shell(10.200.1.6.14)> starting node-program...
> Charmrun remote shell(10.200.1.6.14)> remote shell phase successful.
> Charmrun remote shell(10.200.1.7.21)> remote responding...
> ..
> Charmrun remote shell(10.200.1.15.62)> remote responding...
> Charmrun remote shell(10.200.1.5.7)> starting node-program...
> Charmrun remote shell(10.200.1.5.7)> remote shell phase successful.
> Charmrun remote shell(10.200.1.12.49)> remote responding...
> ...
> Charmrun remote shell(10.200.1.7.21)> starting node-program...
> Charmrun remote shell(10.200.1.9.35)> starting node-program...
> Charmrun remote shell(10.200.1.9.35)> remote shell phase successful.
> Charmrun remote shell(10.200.1.12.49)> starting node-program...
> Charmrun remote shell(10.200.1.12.49)> remote shell phase successful.
> Charmrun remote shell(10.200.1.3.0)> starting node-program...
> Charmrun remote shell(10.200.1.3.0)> remote shell phase successful.
> Charmrun> scalable start enabled.
> Charmrun> adding client 0: "10.200.1.3", IP:10.200.1.3
> Charmrun> adding client 1: "10.200.1.3", IP:10.200.1.3
> Charmrun> adding client 2: "10.200.1.3", IP:10.200.1.3
> Charmrun> adding client 3: "10.200.1.3", IP:10.200.1.3
> Charmrun> adding client 4: "10.200.1.3", IP:10.200.1.3
> Charmrun> adding client 5: "10.200.1.3", IP:10.200.1.3
> Charmrun> adding client 6: "10.200.1.3", IP:10.200.1.3
> Charmrun> adding client 7: "10.200.1.5", IP:10.200.1.5
> Charmrun> adding client 8: "10.200.1.5", IP:10.200.1.5
> Charmrun> adding client 9: "10.200.1.5", IP:10.200.1.5
> Charmrun> adding client 10: "10.200.1.5", IP:10.200.1.5
> Charmrun> adding client 11: "10.200.1.5", IP:10.200.1.5
> Charmrun> adding client 12: "10.200.1.5", IP:10.200.1.5
> Charmrun> adding client 13: "10.200.1.5", IP:10.200.1.5
> Charmrun> adding client 14: "10.200.1.6", IP:10.200.1.6
> Charmrun> adding client 15: "10.200.1.6", IP:10.200.1.6
> Charmrun> adding client 16: "10.200.1.6", IP:10.200.1.6
> Charmrun> adding client 17: "10.200.1.6", IP:10.200.1.6
> Charmrun> adding client 18: "10.200.1.6", IP:10.200.1.6
> Charmrun> adding client 19: "10.200.1.6", IP:10.200.1.6
> Charmrun> adding client 20: "10.200.1.6", IP:10.200.1.6
> Charmrun> adding client 21: "10.200.1.7", IP:10.200.1.7
> Charmrun> adding client 22: "10.200.1.7", IP:10.200.1.7
> Charmrun> adding client 23: "10.200.1.7", IP:10.200.1.7
> Charmrun> adding client 24: "10.200.1.7", IP:10.200.1.7
> Charmrun> adding client 25: "10.200.1.7", IP:10.200.1.7
> Charmrun> adding client 26: "10.200.1.7", IP:10.200.1.7
> Charmrun> adding client 27: "10.200.1.7", IP:10.200.1.7
> Charmrun> adding client 28: "10.200.1.8", IP:10.200.1.8
> Charmrun> adding client 29: "10.200.1.8", IP:10.200.1.8
> Charmrun> adding client 30: "10.200.1.8", IP:10.200.1.8
> Charmrun> adding client 31: "10.200.1.8", IP:10.200.1.8
> Charmrun> adding client 32: "10.200.1.8", IP:10.200.1.8
> Charmrun> adding client 33: "10.200.1.8", IP:10.200.1.8
> Charmrun> adding client 34: "10.200.1.8", IP:10.200.1.8
> Charmrun> adding client 35: "10.200.1.9", IP:10.200.1.9
> Charmrun> adding client 36: "10.200.1.9", IP:10.200.1.9
> Charmrun> adding client 37: "10.200.1.9", IP:10.200.1.9
> Charmrun> adding client 38: "10.200.1.9", IP:10.200.1.9
> Charmrun> adding client 39: "10.200.1.9", IP:10.200.1.9
> Charmrun> adding client 40: "10.200.1.9", IP:10.200.1.9
> Charmrun> adding client 41: "10.200.1.9", IP:10.200.1.9
> Charmrun> adding client 42: "10.200.1.10", IP:10.200.1.10
> Charmrun> adding client 43: "10.200.1.10", IP:10.200.1.10
> Charmrun> adding client 44: "10.200.1.10", IP:10.200.1.10
> Charmrun> adding client 45: "10.200.1.10", IP:10.200.1.10
> Charmrun> adding client 46: "10.200.1.10", IP:10.200.1.10
> Charmrun> adding client 47: "10.200.1.10", IP:10.200.1.10
> Charmrun> adding client 48: "10.200.1.10", IP:10.200.1.10
> Charmrun> adding client 49: "10.200.1.12", IP:10.200.1.12
> Charmrun> adding client 50: "10.200.1.12", IP:10.200.1.12
> Charmrun> adding client 51: "10.200.1.12", IP:10.200.1.12
> Charmrun> adding client 52: "10.200.1.12", IP:10.200.1.12
> Charmrun> adding client 53: "10.200.1.12", IP:10.200.1.12
> Charmrun> adding client 54: "10.200.1.12", IP:10.200.1.12
> Charmrun> adding client 55: "10.200.1.12", IP:10.200.1.12
> Charmrun> adding client 56: "10.200.1.13", IP:10.200.1.13
> Charmrun> adding client 57: "10.200.1.13", IP:10.200.1.13
> Charmrun> adding client 58: "10.200.1.13", IP:10.200.1.13
> Charmrun> adding client 59: "10.200.1.13", IP:10.200.1.13
> Charmrun> adding client 60: "10.200.1.13", IP:10.200.1.13
> Charmrun> adding client 61: "10.200.1.13", IP:10.200.1.13
> Charmrun> adding client 62: "10.200.1.15", IP:10.200.1.15
> Charmrun> adding client 63: "10.200.1.15", IP:10.200.1.15
> Charmrun> adding client 64: "10.200.1.15", IP:10.200.1.15
> Charmrun> adding client 65: "10.200.1.15", IP:10.200.1.15
> Charmrun> adding client 66: "10.200.1.15", IP:10.200.1.15
> Charmrun> adding client 67: "10.200.1.15", IP:10.200.1.15
> Charmrun> adding client 68: "10.200.1.16", IP:10.200.1.16
> Charmrun> adding client 69: "10.200.1.16", IP:10.200.1.16
> Charmrun> adding client 70: "10.200.1.16", IP:10.200.1.16
> Charmrun> adding client 71: "10.200.1.16", IP:10.200.1.16
> Charmrun> adding client 72: "10.200.1.16", IP:10.200.1.16
> Charmrun> adding client 73: "10.200.1.16", IP:10.200.1.16
> Charmrun> adding client 74: "10.200.1.17", IP:10.200.1.17
> Charmrun> adding client 75: "10.200.1.17", IP:10.200.1.17
> Charmrun> adding client 76: "10.200.1.17", IP:10.200.1.17
> Charmrun> adding client 77: "10.200.1.17", IP:10.200.1.17
> Charmrun> adding client 78: "10.200.1.17", IP:10.200.1.17
> Charmrun> adding client 79: "10.200.1.17", IP:10.200.1.17
> Charmrun> Charmrun = 10.200.1.13, port = 60873
> start_nodes_ssh
> Charmrun> Sending "0 10.200.1.13 60873 24622 0" to client 0.
> Charmrun> find the node program "/shared/app/NAMD_Git-2018-09-21_Source/namd2" at "/home/hazards/NAMD" for 0.
> Charmrun> Starting ssh 10.200.1.3 -l hazards -o KbdInteractiveAuthentication=no -o PasswordAuthentication=no -o NoHostAuthenticationForLocalhost=yes /bin/bash -f
> Charmrun> Sending "7 10.200.1.13 60873 24622 0" to client 7.
> Charmrun> find the node program "/shared/app/NAMD_Git-2018-09-21_Source/namd2" at "/home/hazards/NAMD" for 7.
> Charmrun> Starting ssh 10.200.1.5 -l hazards -o KbdInteractiveAuthentication=no -o PasswordAuthentication=no -o NoHostAuthenticationForLocalhost=yes /bin/bash -f
> Charmrun> Sending "14 10.200.1.13 60873 24622 0" to client 14.
> Charmrun> find the node program "/shared/app/NAMD_Git-2018-09-21_Source/namd2" at "/home/hazards/NAMD" for 14.
> Charmrun> Starting ssh 10.200.1.6 -l hazards -o KbdInteractiveAuthentication=no -o PasswordAuthentication=no -o NoHostAuthenticationForLocalhost=yes /bin/bash -f
> Charmrun> Sending "21 10.200.1.13 60873 24622 0" to client 21.
> Charmrun> find the node program "/shared/app/NAMD_Git-2018-09-21_Source/namd2" at "/home/hazards/NAMD" for 21.
> Charmrun> Starting ssh 10.200.1.7 -l hazards -o KbdInteractiveAuthentication=no -o PasswordAuthentication=no -o NoHostAuthenticationForLocalhost=yes /bin/bash -f
> Charmrun> Sending "28 10.200.1.13 60873 24622 0" to client 28.
> Charmrun> find the node program "/shared/app/NAMD_Git-2018-09-21_Source/namd2" at "/home/hazards/NAMD" for 28.
> Charmrun> Starting ssh 10.200.1.8 -l hazards -o KbdInteractiveAuthentication=no -o PasswordAuthentication=no -o NoHostAuthenticationForLocalhost=yes /bin/bash -f
> Charmrun> Sending "35 10.200.1.13 60873 24622 0" to client 35.
> Charmrun> find the node program "/shared/app/NAMD_Git-2018-09-21_Source/namd2" at "/home/hazards/NAMD" for 35.
> Charmrun> Starting ssh 10.200.1.9 -l hazards -o KbdInteractiveAuthentication=no -o PasswordAuthentication=no -o NoHostAuthenticationForLocalhost=yes /bin/bash -f
> Charmrun> Sending "42 10.200.1.13 60873 24622 0" to client 42.
> Charmrun> find the node program "/shared/app/NAMD_Git-2018-09-21_Source/namd2" at "/home/hazards/NAMD" for 42.
> Charmrun> Starting ssh 10.200.1.10 -l hazards -o KbdInteractiveAuthentication=no -o PasswordAuthentication=no -o NoHostAuthenticationForLocalhost=yes /bin/bash -f
> Charmrun> Sending "49 10.200.1.13 60873 24622 0" to client 49.
> Charmrun> find the node program "/shared/app/NAMD_Git-2018-09-21_Source/namd2" at "/home/hazards/NAMD" for 49.
> Charmrun> Starting ssh 10.200.1.12 -l hazards -o KbdInteractiveAuthentication=no -o PasswordAuthentication=no -o NoHostAuthenticationForLocalhost=yes /bin/bash -f
> Charmrun> Sending "56 10.200.1.13 60873 24622 0" to client 56.
> Charmrun> find the node program "/shared/app/NAMD_Git-2018-09-21_Source/namd2" at "/home/hazards/NAMD" for 56.
> Charmrun> Starting ssh 10.200.1.13 -l hazards -o KbdInteractiveAuthentication=no -o PasswordAuthentication=no -o NoHostAuthenticationForLocalhost=yes /bin/bash -f
> Charmrun> Sending "62 10.200.1.13 60873 24622 0" to client 62.
> Charmrun> find the node program "/shared/app/NAMD_Git-2018-09-21_Source/namd2" at "/home/hazards/NAMD" for 62.
> Charmrun> Starting ssh 10.200.1.15 -l hazards -o KbdInteractiveAuthentication=no -o PasswordAuthentication=no -o NoHostAuthenticationForLocalhost=yes /bin/bash -f
> Charmrun> Sending "68 10.200.1.13 60873 24622 0" to client 68.
> Charmrun> find the node program "/shared/app/NAMD_Git-2018-09-21_Source/namd2" at "/home/hazards/NAMD" for 68.
> Charmrun> Starting ssh 10.200.1.16 -l hazards -o KbdInteractiveAuthentication=no -o PasswordAuthentication=no -o NoHostAuthenticationForLocalhost=yes /bin/bash -f
> Charmrun> Sending "74 10.200.1.13 60873 24622 0" to client 74.
> Charmrun> find the node program "/shared/app/NAMD_Git-2018-09-21_Source/namd2" at "/home/hazards/NAMD" for 74.
> Charmrun> Starting ssh 10.200.1.17 -l hazards -o KbdInteractiveAuthentication=no -o PasswordAuthentication=no -o NoHostAuthenticationForLocalhost=yes /bin/bash -f
> Charmrun> Waiting for 0-th client to connect.
>
> here are my nodelist files. I have tried both
>
> cat nodelist
> group main ++shell ssh
> host 10.200.1.3
> host 10.200.1.5
> host 10.200.1.6
> host 10.200.1.7
> host 10.200.1.8
> host 10.200.1.9
> host 10.200.1.10
> host 10.200.1.12
> host 10.200.1.13
> host 10.200.1.15
> host 10.200.1.16
> host 10.200.1.17
> hpcc3:/home/hazards/NAMD: cat nodelist.Oct31
> group main ++shell ssh
> host compute000
> host compute002
> host compute003
> host compute004
> host compute005
> host compute006
> host compute007
> host compute009
> host compute010
> host compute012
> host compute013
> host compute013
> host compute014
>
>
>
> I have tried to understand the advice given here https://www.ks.uiuc.edu/Research/namd/mailing_list/namd-l.2013-2014/0538.html
> I can ping my hostname from any and all nodes.
>
> I need some help. Thanks in advance
>
> Starr
>
>
>
>
>
> -------------------------------------------------------------------------
> This message was secured via TLS by MUSC.
>
This archive was generated by hypermail 2.1.6 : Tue Dec 31 2019 - 23:20:18 CST