From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Thu Dec 05 2013 - 04:33:12 CST
And does namd write a simulation log?
Norman Geist.
Von: Anna Gorska [mailto:gvalchca.agr_at_gmail.com]
Gesendet: Donnerstag, 5. Dezember 2013 11:14
An: Norman Geist
Cc: Namd Mailing List
Betreff: Re: namd-l: Running NAMD on SGE cluster - multiple modes and cores
This is what I got:
[proxy:0:0_at_node502] HYD_pmcd_pmip_control_cmd_cb
(./pm/pmiserv/pmip_cb.c:928): assert (!closed) failed
[proxy:0:0_at_node502] HYDT_dmxu_poll_wait_for_event
(./tools/demux/demux_poll.c:77): callback returned error stat
us
[proxy:0:0_at_node502] main (./pm/pmiserv/pmip.c:226): demux engine error
waiting for event
[mpiexec_at_node502] HYDT_bscu_wait_for_completion
(./tools/bootstrap/utils/bscu_wait.c:70): one of the processes
terminated badly; aborting
[mpiexec_at_node502] HYDT_bsci_wait_for_completion
(./tools/bootstrap/src/bsci_wait.c:23): launcher returned error
waiting for completion
[mpiexec_at_node502] HYD_pmci_wait_for_completion
(./pm/pmiserv/pmiserv_pmci.c:191): launcher returned error waiti
ng for completion
[mpiexec_at_node502] main (./ui/mpich/mpiexec.c:405): process manager error
waiting for completion
Regards,
Anna Gorska
Something in the namd log? Also, to catch the stderr output you can append
"2> errors" to your call.
Norman Geist.
Von: Anna Gorska [mailto:gvalchca.agr_at_gmail.com]
Gesendet: Donnerstag, 5. Dezember 2013 11:01
An: Norman Geist
Betreff: Re: namd-l: Running NAMD on SGE cluster - multiple modes and cores
Ok,
it runs on both nodes parallel for some time and finishes with:
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= EXIT CODE: 9
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
Regards,
Anna Gorska
Don't your SGE provide a machinefile? I usually use something like
mpirun -np $NSLOTS -np $TMPDIR/machines namd2 conf > out
Norman Geist.
Von: Anna Gorska [ <mailto:gvalchca.agr_at_gmail.com>
mailto:gvalchca.agr_at_gmail.com]
Gesendet: Mittwoch, 4. Dezember 2013 15:47
An: Norman Geist
Betreff: Re: namd-l: Running NAMD on SGE cluster - multiple modes and cores
Hello,
running like this with SGE_UNSET via qsub
charmrun ++verbose ++mpiexec ++remote-shell /usr/bin/mpirun +p20 namd2
conf > out
makes namd to run only on one node (although the node file is proper).
If I run from qsub - runs on one sever and doesn't crash.
mpirun -np 20 namd2 conf > out
I am using the SGE parallel environment,
Regards,
Anna Gorska
This is another issue. The charmrun ibverbs stuff isn't working with every
infiniband hardware. Therefore one would usually build namd against mpi.
This you try "unset SGE_ROOT" and starting with mpiexec? Do also try to
leave out charmrun at all and simply use mpirun.
Norman Geist.
Von: Anna Gorska [ <mailto:gvalchca.agr_at_gmail.com>
mailto:gvalchca.agr_at_gmail.com]
Gesendet: Mittwoch, 4. Dezember 2013 12:52
An: Norman Geist
Betreff: Re: namd-l: Running NAMD on SGE cluster - multiple modes and cores
Hello,
I identified the problem, although don't know how to fix it.
You were right it works also without mpi as you suggested:
charmrun ++verbose ++nodelist namd-machines-test +p10 namd2 conf > out
but than it runs on requested nodes and cpus for about a minute and crashes
with:
Charmrun: error on requested socket-
Socket closed before recv.
It repeats always - independently on the number of CPUs/memory/nodes,
Regards,
Anna Gorska
If you use the Sun Grid Engine, try "unset SGE_ROOT" within your jobscript,
before hitting the mpirun/charmrun. It's similar on our cluster and seems to
be related to the sge support within openmpi.
Norman Geist.
Von: <mailto:owner-namd-l_at_ks.uiuc.edu> owner-namd-l_at_ks.uiuc.edu [
<mailto:owner-namd-l_at_ks.uiuc.edu> mailto:owner-namd-l_at_ks.uiuc.edu] Im
Auftrag von Anna Gorska
Gesendet: Dienstag, 3. Dezember 2013 16:24
An: Norman Geist
Cc: Namd Mailing List
Betreff: Re: namd-l: Running NAMD on SGE cluster - multiple modes and cores
Hello,
thank you for the quick response although it still doesn't work. Without MPI
it does not even produce error message, but you were right,
I set the MPI environment variable to give the file with hosts to the mpi.
But still it runs only on the one node.
My new command looks like this:
charmrun ++mpiexec ++remote-shell mpirun +p124 namd2 conf > out
Regards,
Anna Gorska
Of course, if you use mpiexec, it is expected that the queuing system
provides the list of machines to the mpirun directly, so the nodelist is not
used. Try
charmrun ++nodelist namd-machines +p104 namd2 in > out
And read the instructions for using mpiexec again, as I don't know it by
heart now.
Norman Geist.
Von: <mailto:owner-namd-l_at_ks.uiuc.edu> owner-namd-l_at_ks.uiuc.edu [
<mailto:owner-namd-l_at_ks.uiuc.edu> mailto:owner-namd-l_at_ks.uiuc.edu] Im
Auftrag von Anna Gorska
Gesendet: Dienstag, 3. Dezember 2013 12:26
An: <mailto:namd-l_at_ks.uiuc.edu> namd-l_at_ks.uiuc.edu
Betreff: namd-l: Running NAMD on SGE cluster - multiple modes and cores
Hello,
I try to run a NAMD (using NAMD_2.9_Linux-x86_64 version) on multiple nodes
on SGE cluster.
I am able to generate adequate file with list of nodes subscribed to me by
the queuing system,
but the NAMD runs always only on one node taking the specified number of
cores -
it behaves as if the ++nodelist was not there at all.
This is the command I use:
charmrun ++mpiexec ++remote-shell mpirun ++nodelist namd-machines +p104
namd2 in > out
and the namd-machines file looks as follows:
group main ++pathfix /step2_2_tmp
host node508
host node508
host node508
host node508
host node508
host node508
host node508
host node508
host node508
host node508
host node508
host node503
host node503
host node503
host node503
host node503
host node503
Sincerely,
Anna Gorska
____________
PHD student
Algorithms in Bioinformatics
University of Tuebingen
Germany
_____
Diese E-Mail ist frei von Viren und Malware, denn der
<http://www.avast.com/> avast! Antivirus Schutz ist aktiv.
_____
Diese E-Mail ist frei von Viren und Malware, denn der
<http://www.avast.com/> avast! Antivirus Schutz ist aktiv.
_____
Diese E-Mail ist frei von Viren und Malware, denn der
<http://www.avast.com/> avast! Antivirus Schutz ist aktiv.
_____
Diese E-Mail ist frei von Viren und Malware, denn der
<http://www.avast.com/> avast! Antivirus Schutz ist aktiv.
_____
Diese E-Mail ist frei von Viren und Malware, denn der
<http://www.avast.com/> avast! Antivirus Schutz ist aktiv.
--- Diese E-Mail ist frei von Viren und Malware, denn der avast! Antivirus Schutz ist aktiv. http://www.avast.com
This archive was generated by hypermail 2.1.6 : Tue Dec 31 2013 - 23:24:04 CST