AW: Running NAMD on SGE cluster - multiple modes and cores

From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Thu Dec 05 2013 - 04:33:12 CST

And does namd write a simulation log?

 

Norman Geist.

 

Von: Anna Gorska [mailto:gvalchca.agr_at_gmail.com]
Gesendet: Donnerstag, 5. Dezember 2013 11:14
An: Norman Geist
Cc: Namd Mailing List
Betreff: Re: namd-l: Running NAMD on SGE cluster - multiple modes and cores

 

 

This is what I got:

 

[proxy:0:0_at_node502] HYD_pmcd_pmip_control_cmd_cb
(./pm/pmiserv/pmip_cb.c:928): assert (!closed) failed

[proxy:0:0_at_node502] HYDT_dmxu_poll_wait_for_event
(./tools/demux/demux_poll.c:77): callback returned error stat

us

[proxy:0:0_at_node502] main (./pm/pmiserv/pmip.c:226): demux engine error
waiting for event

[mpiexec_at_node502] HYDT_bscu_wait_for_completion
(./tools/bootstrap/utils/bscu_wait.c:70): one of the processes

terminated badly; aborting

[mpiexec_at_node502] HYDT_bsci_wait_for_completion
(./tools/bootstrap/src/bsci_wait.c:23): launcher returned error

 waiting for completion

[mpiexec_at_node502] HYD_pmci_wait_for_completion
(./pm/pmiserv/pmiserv_pmci.c:191): launcher returned error waiti

ng for completion

[mpiexec_at_node502] main (./ui/mpich/mpiexec.c:405): process manager error
waiting for completion

 

Regards,

Anna Gorska

 

 

Something in the namd log? Also, to catch the stderr output you can append
"2> errors" to your call.

 

Norman Geist.

 

Von: Anna Gorska [mailto:gvalchca.agr_at_gmail.com]
Gesendet: Donnerstag, 5. Dezember 2013 11:01
An: Norman Geist
Betreff: Re: namd-l: Running NAMD on SGE cluster - multiple modes and cores

 

 

Ok,

 

it runs on both nodes parallel for some time and finishes with:

 

= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES

= EXIT CODE: 9

= CLEANING UP REMAINING PROCESSES

= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

 

Regards,

Anna Gorska

 

 

Don't your SGE provide a machinefile? I usually use something like

 

mpirun -np $NSLOTS -np $TMPDIR/machines namd2 conf > out

 

Norman Geist.

 

Von: Anna Gorska [ <mailto:gvalchca.agr_at_gmail.com>
mailto:gvalchca.agr_at_gmail.com]
Gesendet: Mittwoch, 4. Dezember 2013 15:47
An: Norman Geist
Betreff: Re: namd-l: Running NAMD on SGE cluster - multiple modes and cores

 

Hello,

running like this with SGE_UNSET via qsub

charmrun ++verbose ++mpiexec ++remote-shell /usr/bin/mpirun +p20 namd2
conf > out

makes namd to run only on one node (although the node file is proper).

 

If I run from qsub - runs on one sever and doesn't crash.

mpirun -np 20 namd2 conf > out

I am using the SGE parallel environment,

 

Regards,

Anna Gorska

 

 

 

This is another issue. The charmrun ibverbs stuff isn't working with every
infiniband hardware. Therefore one would usually build namd against mpi.
This you try "unset SGE_ROOT" and starting with mpiexec? Do also try to
leave out charmrun at all and simply use mpirun.

 

Norman Geist.

 

Von: Anna Gorska [ <mailto:gvalchca.agr_at_gmail.com>
mailto:gvalchca.agr_at_gmail.com]
Gesendet: Mittwoch, 4. Dezember 2013 12:52
An: Norman Geist
Betreff: Re: namd-l: Running NAMD on SGE cluster - multiple modes and cores

 

Hello,

 

I identified the problem, although don't know how to fix it.

You were right it works also without mpi as you suggested:

 

charmrun ++verbose ++nodelist namd-machines-test +p10 namd2 conf > out

 

but than it runs on requested nodes and cpus for about a minute and crashes
with:

 

Charmrun: error on requested socket-

Socket closed before recv.

 

It repeats always - independently on the number of CPUs/memory/nodes,

 

Regards,

Anna Gorska

 

 

 

 

If you use the Sun Grid Engine, try "unset SGE_ROOT" within your jobscript,
before hitting the mpirun/charmrun. It's similar on our cluster and seems to
be related to the sge support within openmpi.

 

Norman Geist.

 

Von: <mailto:owner-namd-l_at_ks.uiuc.edu> owner-namd-l_at_ks.uiuc.edu [
<mailto:owner-namd-l_at_ks.uiuc.edu> mailto:owner-namd-l_at_ks.uiuc.edu] Im
Auftrag von Anna Gorska
Gesendet: Dienstag, 3. Dezember 2013 16:24
An: Norman Geist
Cc: Namd Mailing List
Betreff: Re: namd-l: Running NAMD on SGE cluster - multiple modes and cores

 

 

Hello,

thank you for the quick response although it still doesn't work. Without MPI
it does not even produce error message, but you were right,

I set the MPI environment variable to give the file with hosts to the mpi.

 

But still it runs only on the one node.

My new command looks like this:

charmrun ++mpiexec ++remote-shell mpirun +p124 namd2 conf > out

 

Regards,

Anna Gorska

 

 

Of course, if you use mpiexec, it is expected that the queuing system
provides the list of machines to the mpirun directly, so the nodelist is not
used. Try

 

charmrun ++nodelist namd-machines +p104 namd2 in > out

 

And read the instructions for using mpiexec again, as I don't know it by
heart now.

 

Norman Geist.

 

Von: <mailto:owner-namd-l_at_ks.uiuc.edu> owner-namd-l_at_ks.uiuc.edu [
<mailto:owner-namd-l_at_ks.uiuc.edu> mailto:owner-namd-l_at_ks.uiuc.edu] Im
Auftrag von Anna Gorska
Gesendet: Dienstag, 3. Dezember 2013 12:26
An: <mailto:namd-l_at_ks.uiuc.edu> namd-l_at_ks.uiuc.edu
Betreff: namd-l: Running NAMD on SGE cluster - multiple modes and cores

 

Hello,

 

I try to run a NAMD (using NAMD_2.9_Linux-x86_64 version) on multiple nodes
on SGE cluster.

 

I am able to generate adequate file with list of nodes subscribed to me by
the queuing system,

but the NAMD runs always only on one node taking the specified number of
cores -

it behaves as if the ++nodelist was not there at all.

 

This is the command I use:

 

charmrun ++mpiexec ++remote-shell mpirun ++nodelist namd-machines +p104
namd2 in > out

 

 

and the namd-machines file looks as follows:

 

group main ++pathfix /step2_2_tmp

 host node508

 host node508

 host node508

 host node508

 host node508

 host node508

 host node508

 host node508

 host node508

 host node508

 host node508

 host node503

 host node503

 host node503

 host node503

 host node503

 host node503

 

 

Sincerely,

Anna Gorska

____________

 

PHD student

Algorithms in Bioinformatics

University of Tuebingen

Germany

 

  _____

 <http://www.avast.com/>

Diese E-Mail ist frei von Viren und Malware, denn der
<http://www.avast.com/> avast! Antivirus Schutz ist aktiv.

 

 

  _____

 <http://www.avast.com/>

Diese E-Mail ist frei von Viren und Malware, denn der
<http://www.avast.com/> avast! Antivirus Schutz ist aktiv.

 

 

  _____

 <http://www.avast.com/>

Diese E-Mail ist frei von Viren und Malware, denn der
<http://www.avast.com/> avast! Antivirus Schutz ist aktiv.

 

 

  _____

 <http://www.avast.com/>

Diese E-Mail ist frei von Viren und Malware, denn der
<http://www.avast.com/> avast! Antivirus Schutz ist aktiv.

 

 

  _____

 <http://www.avast.com/>

Diese E-Mail ist frei von Viren und Malware, denn der
<http://www.avast.com/> avast! Antivirus Schutz ist aktiv.

 

---
Diese E-Mail ist frei von Viren und Malware, denn der avast! Antivirus Schutz ist aktiv.
http://www.avast.com

This archive was generated by hypermail 2.1.6 : Tue Dec 31 2013 - 23:24:04 CST