AW: Running NAMD on SGE cluster - multiple modes and cores

From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Thu Dec 05 2013 - 04:57:18 CST

Maybe to verifiy that your namd build is working at all, write a machinefile
manually and run namd from commandline. A machinefile simply contain node
names, like:

 

C01

C02

C03

C04

 

Then start like:

 

mpirun –machinefile machinefile –np 20 namd2 conf > out 2> error

 

Did you noticed my typo

 

“-np $TMPDIR/machines”

 

Should of course be

 

-machinefile $TMPDIR/machines

 

Mit freundlichen Grüßen

 

Norman Geist.

 

Von: Anna Gorska [mailto:gvalchca.agr_at_gmail.com]
Gesendet: Donnerstag, 5. Dezember 2013 11:44
An: Norman Geist
Betreff: Re: namd-l: Running NAMD on SGE cluster - multiple modes and cores

 

Everything is bad, mpi just runs it twice in NAMD output there is twice the
same step:

 

OPENING EXTENDED SYSTEM TRAJECTORY FILE

TCL: Setting parameter constraintScaling to 0.5

TCL: Running for 200000 steps

PRESSURE: 0 597.333 103.702 184.113 135.273 640.616 -180.201 172.061
-244.037 329.661

GPRESSURE: 0 638.681 100.807 234.545 104.568 709.205 -175.087 129.792
-190.157 410.687

ETITLE: TS BOND ANGLE DIHED IMPRP
ELECT

      VDW BOUNDARY MISC KINETIC TOTAL
TEMP POTENT

IAL TOTAL3 TEMPAVG PRESSURE GPRESSURE
VOLUME PRESSAVG

   GPRESSAVG

 

ENERGY: 0 724.7455 2809.4883 4471.3331 2.6115
-255531.9064 2

0541.9404 343.2310 0.0000 34858.3561 -191780.2005
267.0207 -226638.5

566 -191752.7939 267.0207 522.5365 586.1911
691151.1680 522.5365

    586.1911

 

OPENING EXTENDED SYSTEM TRAJECTORY FILE

TCL: Setting parameter constraintScaling to 0.5

TCL: Running for 200000 steps

PRESSURE: 0 695.049 118.602 177.091 134.388 795.617 -215.887 171.065
-247.805 508.959

GPRESSURE: 0 736.37 116.273 227.721 104.245 864.461 -211.449 129.008 -194.59
589.271

ETITLE: TS BOND ANGLE DIHED IMPRP
ELECT

      VDW BOUNDARY MISC KINETIC TOTAL
TEMP POTENT

IAL TOTAL3 TEMPAVG PRESSURE GPRESSURE
VOLUME PRESSAVG

   GPRESSAVG

 

ENERGY: 0 724.7455 2809.4883 4471.3331 2.6115
-255531.9064 2

0541.9404 171.6155 0.0000 34858.3561 -191951.8160
267.0207 -226810.1

721 -191924.3864 267.0207 666.5418 730.0339
691151.1680 666.5418

    730.0339

 

 

Maybe I should specify something for NAMD ?

Anna Gorska

 

 

And does namd write a simulation log?

 

Norman Geist.

 

Von: Anna Gorska [mailto:gvalchca.agr_at_gmail.com]
Gesendet: Donnerstag, 5. Dezember 2013 11:14
An: Norman Geist
Cc: Namd Mailing List
Betreff: Re: namd-l: Running NAMD on SGE cluster - multiple modes and cores

 

 

This is what I got:

 

[proxy:0:0_at_node502] HYD_pmcd_pmip_control_cmd_cb
(./pm/pmiserv/pmip_cb.c:928): assert (!closed) failed

[proxy:0:0_at_node502] HYDT_dmxu_poll_wait_for_event
(./tools/demux/demux_poll.c:77): callback returned error stat

us

[proxy:0:0_at_node502] main (./pm/pmiserv/pmip.c:226): demux engine error
waiting for event

[mpiexec_at_node502] HYDT_bscu_wait_for_completion
(./tools/bootstrap/utils/bscu_wait.c:70): one of the processes

terminated badly; aborting

[mpiexec_at_node502] HYDT_bsci_wait_for_completion
(./tools/bootstrap/src/bsci_wait.c:23): launcher returned error

 waiting for completion

[mpiexec_at_node502] HYD_pmci_wait_for_completion
(./pm/pmiserv/pmiserv_pmci.c:191): launcher returned error waiti

ng for completion

[mpiexec_at_node502] main (./ui/mpich/mpiexec.c:405): process manager error
waiting for completion

 

Regards,

Anna Gorska

 

 

Something in the namd log? Also, to catch the stderr output you can append
“2> errors” to your call.

 

Norman Geist.

 

Von: Anna Gorska [ <mailto:gvalchca.agr_at_gmail.com>
mailto:gvalchca.agr_at_gmail.com]
Gesendet: Donnerstag, 5. Dezember 2013 11:01
An: Norman Geist
Betreff: Re: namd-l: Running NAMD on SGE cluster - multiple modes and cores

 

 

Ok,

 

it runs on both nodes parallel for some time and finishes with:

 

= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES

= EXIT CODE: 9

= CLEANING UP REMAINING PROCESSES

= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

 

Regards,

Anna Gorska

 

 

Don’t your SGE provide a machinefile? I usually use something like

 

mpirun –np $NSLOTS –np $TMPDIR/machines namd2 conf > out

 

Norman Geist.

 

Von: Anna Gorska [ <mailto:gvalchca.agr_at_gmail.com>
mailto:gvalchca.agr_at_gmail.com]
Gesendet: Mittwoch, 4. Dezember 2013 15:47
An: Norman Geist
Betreff: Re: namd-l: Running NAMD on SGE cluster - multiple modes and cores

 

Hello,

running like this with SGE_UNSET via qsub

charmrun ++verbose ++mpiexec ++remote-shell /usr/bin/mpirun +p20 namd2
conf > out

makes namd to run only on one node (although the node file is proper).

 

If I run from qsub - runs on one sever and doesn’t crash.

mpirun -np 20 namd2 conf > out

I am using the SGE parallel environment,

 

Regards,

Anna Gorska

 

 

 

This is another issue. The charmrun ibverbs stuff isn’t working with every
infiniband hardware. Therefore one would usually build namd against mpi.
This you try “unset SGE_ROOT” and starting with mpiexec? Do also try to
leave out charmrun at all and simply use mpirun.

 

Norman Geist.

 

Von: Anna Gorska [ <mailto:gvalchca.agr_at_gmail.com>
mailto:gvalchca.agr_at_gmail.com]
Gesendet: Mittwoch, 4. Dezember 2013 12:52
An: Norman Geist
Betreff: Re: namd-l: Running NAMD on SGE cluster - multiple modes and cores

 

Hello,

 

I identified the problem, although don’t know how to fix it.

You were right it works also without mpi as you suggested:

 

charmrun ++verbose ++nodelist namd-machines-test +p10 namd2 conf > out

 

but than it runs on requested nodes and cpus for about a minute and crashes
with:

 

Charmrun: error on requested socket—

Socket closed before recv.

 

It repeats always - independently on the number of CPUs/memory/nodes,

 

Regards,

Anna Gorska

 

 

 

 

If you use the Sun Grid Engine, try „unset SGE_ROOT“ within your jobscript,
before hitting the mpirun/charmrun. It’s similar on our cluster and seems to
be related to the sge support within openmpi.

 

Norman Geist.

 

Von: <mailto:owner-namd-l_at_ks.uiuc.edu> owner-namd-l_at_ks.uiuc.edu [
<mailto:owner-namd-l_at_ks.uiuc.edu> mailto:owner-namd-l_at_ks.uiuc.edu] Im
Auftrag von Anna Gorska
Gesendet: Dienstag, 3. Dezember 2013 16:24
An: Norman Geist
Cc: Namd Mailing List
Betreff: Re: namd-l: Running NAMD on SGE cluster - multiple modes and cores

 

 

Hello,

thank you for the quick response although it still doesn’t work. Without MPI
it does not even produce error message, but you were right,

I set the MPI environment variable to give the file with hosts to the mpi.

 

But still it runs only on the one node.

My new command looks like this:

charmrun ++mpiexec ++remote-shell mpirun +p124 namd2 conf > out

 

Regards,

Anna Gorska

 

 

Of course, if you use mpiexec, it is expected that the queuing system
provides the list of machines to the mpirun directly, so the nodelist is not
used. Try

 

charmrun ++nodelist namd-machines +p104 namd2 in > out

 

And read the instructions for using mpiexec again, as I don’t know it by
heart now.

 

Norman Geist.

 

Von: <mailto:owner-namd-l_at_ks.uiuc.edu> owner-namd-l_at_ks.uiuc.edu [
<mailto:owner-namd-l_at_ks.uiuc.edu> mailto:owner-namd-l_at_ks.uiuc.edu] Im
Auftrag von Anna Gorska
Gesendet: Dienstag, 3. Dezember 2013 12:26
An: <mailto:namd-l_at_ks.uiuc.edu> namd-l_at_ks.uiuc.edu
Betreff: namd-l: Running NAMD on SGE cluster - multiple modes and cores

 

Hello,

 

I try to run a NAMD (using NAMD_2.9_Linux-x86_64 version) on multiple nodes
on SGE cluster.

 

I am able to generate adequate file with list of nodes subscribed to me by
the queuing system,

but the NAMD runs always only on one node taking the specified number of
cores -

it behaves as if the ++nodelist was not there at all.

 

This is the command I use:

 

charmrun ++mpiexec ++remote-shell mpirun ++nodelist namd-machines +p104
namd2 in > out

 

 

and the namd-machines file looks as follows:

 

group main ++pathfix /step2_2_tmp

 host node508

 host node508

 host node508

 host node508

 host node508

 host node508

 host node508

 host node508

 host node508

 host node508

 host node508

 host node503

 host node503

 host node503

 host node503

 host node503

 host node503

 

 

Sincerely,

Anna Gorska

____________

 

PHD student

Algorithms in Bioinformatics

University of Tuebingen

Germany

 

  _____

 <http://www.avast.com/>

Diese E-Mail ist frei von Viren und Malware, denn der
<http://www.avast.com/> avast! Antivirus Schutz ist aktiv.

 

 

  _____

 <http://www.avast.com/>

Diese E-Mail ist frei von Viren und Malware, denn der
<http://www.avast.com/> avast! Antivirus Schutz ist aktiv.

 

 

  _____

 <http://www.avast.com/>

Diese E-Mail ist frei von Viren und Malware, denn der
<http://www.avast.com/> avast! Antivirus Schutz ist aktiv.

 

 

  _____

 <http://www.avast.com/>

Diese E-Mail ist frei von Viren und Malware, denn der
<http://www.avast.com/> avast! Antivirus Schutz ist aktiv.

 

 

  _____

 <http://www.avast.com/>

Diese E-Mail ist frei von Viren und Malware, denn der
<http://www.avast.com/> avast! Antivirus Schutz ist aktiv.

 

 

  _____

 <http://www.avast.com/>

Diese E-Mail ist frei von Viren und Malware, denn der
<http://www.avast.com/> avast! Antivirus Schutz ist aktiv.

 

---
Diese E-Mail ist frei von Viren und Malware, denn der avast! Antivirus Schutz ist aktiv.
http://www.avast.com

This archive was generated by hypermail 2.1.6 : Tue Dec 31 2013 - 23:24:03 CST