Re: Running NAMD on SGE cluster - multiple modes and cores

From: Anna Gorska (gvalchca.agr_at_gmail.com)
Date: Thu Dec 05 2013 - 05:09:27 CST

again, repeats as many -n is specified.

this is my command
I am using mpi in Hydra_Process_Manager:

mpirun -f namd-machines -n 4 namd2 step2_1.conf> out 2>error

Anna Gorska

> Maybe to verifiy that your namd build is working at all, write a machinefile manually and run namd from commandline. A machinefile simply contain node names, like:
>
> C01
> C02
> C03
> C04
>
> Then start like:
>
> mpirun –machinefile machinefile –np 20 namd2 conf > out 2> error
>
> Did you noticed my typo
>
> “-np $TMPDIR/machines”
>
> Should of course be
>
> -machinefile $TMPDIR/machines
>
> Mit freundlichen Grüßen
>
> Norman Geist.
>
> Von: Anna Gorska [mailto:gvalchca.agr_at_gmail.com]
> Gesendet: Donnerstag, 5. Dezember 2013 11:44
> An: Norman Geist
> Betreff: Re: namd-l: Running NAMD on SGE cluster - multiple modes and cores
>
> Everything is bad, mpi just runs it twice in NAMD output there is twice the same step:
>
> OPENING EXTENDED SYSTEM TRAJECTORY FILE
> TCL: Setting parameter constraintScaling to 0.5
> TCL: Running for 200000 steps
> PRESSURE: 0 597.333 103.702 184.113 135.273 640.616 -180.201 172.061 -244.037 329.661
> GPRESSURE: 0 638.681 100.807 234.545 104.568 709.205 -175.087 129.792 -190.157 410.687
> ETITLE: TS BOND ANGLE DIHED IMPRP ELECT
> VDW BOUNDARY MISC KINETIC TOTAL TEMP POTENT
> IAL TOTAL3 TEMPAVG PRESSURE GPRESSURE VOLUME PRESSAVG
> GPRESSAVG
>
> ENERGY: 0 724.7455 2809.4883 4471.3331 2.6115 -255531.9064 2
> 0541.9404 343.2310 0.0000 34858.3561 -191780.2005 267.0207 -226638.5
> 566 -191752.7939 267.0207 522.5365 586.1911 691151.1680 522.5365
> 586.1911
>
> OPENING EXTENDED SYSTEM TRAJECTORY FILE
> TCL: Setting parameter constraintScaling to 0.5
> TCL: Running for 200000 steps
> PRESSURE: 0 695.049 118.602 177.091 134.388 795.617 -215.887 171.065 -247.805 508.959
> GPRESSURE: 0 736.37 116.273 227.721 104.245 864.461 -211.449 129.008 -194.59 589.271
> ETITLE: TS BOND ANGLE DIHED IMPRP ELECT
> VDW BOUNDARY MISC KINETIC TOTAL TEMP POTENT
> IAL TOTAL3 TEMPAVG PRESSURE GPRESSURE VOLUME PRESSAVG
> GPRESSAVG
>
> ENERGY: 0 724.7455 2809.4883 4471.3331 2.6115 -255531.9064 2
> 0541.9404 171.6155 0.0000 34858.3561 -191951.8160 267.0207 -226810.1
> 721 -191924.3864 267.0207 666.5418 730.0339 691151.1680 666.5418
> 730.0339
>
>
> Maybe I should specify something for NAMD ?
> Anna Gorska
>
>
> And does namd write a simulation log?
>
> Norman Geist.
>
> Von: Anna Gorska [mailto:gvalchca.agr_at_gmail.com]
> Gesendet: Donnerstag, 5. Dezember 2013 11:14
> An: Norman Geist
> Cc: Namd Mailing List
> Betreff: Re: namd-l: Running NAMD on SGE cluster - multiple modes and cores
>
>
> This is what I got:
>
> [proxy:0:0_at_node502] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:928): assert (!closed) failed
> [proxy:0:0_at_node502] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error stat
> us
> [proxy:0:0_at_node502] main (./pm/pmiserv/pmip.c:226): demux engine error waiting for event
> [mpiexec_at_node502] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:70): one of the processes
> terminated badly; aborting
> [mpiexec_at_node502] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error
> waiting for completion
> [mpiexec_at_node502] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:191): launcher returned error waiti
> ng for completion
> [mpiexec_at_node502] main (./ui/mpich/mpiexec.c:405): process manager error waiting for completion
>
> Regards,
> Anna Gorska
>
>
> Something in the namd log? Also, to catch the stderr output you can append “2> errors” to your call.
>
> Norman Geist.
>
> Von: Anna Gorska [mailto:gvalchca.agr_at_gmail.com]
> Gesendet: Donnerstag, 5. Dezember 2013 11:01
> An: Norman Geist
> Betreff: Re: namd-l: Running NAMD on SGE cluster - multiple modes and cores
>
>
> Ok,
>
> it runs on both nodes parallel for some time and finishes with:
>
> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> = EXIT CODE: 9
> = CLEANING UP REMAINING PROCESSES
> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>
> Regards,
> Anna Gorska
>
>
> Don’t your SGE provide a machinefile? I usually use something like
>
> mpirun –np $NSLOTS –np $TMPDIR/machines namd2 conf > out
>
> Norman Geist.
>
> Von: Anna Gorska [mailto:gvalchca.agr_at_gmail.com]
> Gesendet: Mittwoch, 4. Dezember 2013 15:47
> An: Norman Geist
> Betreff: Re: namd-l: Running NAMD on SGE cluster - multiple modes and cores
>
> Hello,
> running like this with SGE_UNSET via qsub
> charmrun ++verbose ++mpiexec ++remote-shell /usr/bin/mpirun +p20 namd2 conf > out
> makes namd to run only on one node (although the node file is proper).
>
> If I run from qsub - runs on one sever and doesn’t crash.
> mpirun -np 20 namd2 conf > out
> I am using the SGE parallel environment,
>
> Regards,
> Anna Gorska
>
>
>
> This is another issue. The charmrun ibverbs stuff isn’t working with every infiniband hardware. Therefore one would usually build namd against mpi. This you try “unset SGE_ROOT” and starting with mpiexec? Do also try to leave out charmrun at all and simply use mpirun.
>
> Norman Geist.
>
> Von: Anna Gorska [mailto:gvalchca.agr_at_gmail.com]
> Gesendet: Mittwoch, 4. Dezember 2013 12:52
> An: Norman Geist
> Betreff: Re: namd-l: Running NAMD on SGE cluster - multiple modes and cores
>
> Hello,
>
> I identified the problem, although don’t know how to fix it.
> You were right it works also without mpi as you suggested:
>
> charmrun ++verbose ++nodelist namd-machines-test +p10 namd2 conf > out
>
> but than it runs on requested nodes and cpus for about a minute and crashes with:
>
> Charmrun: error on requested socket—
> Socket closed before recv.
>
> It repeats always - independently on the number of CPUs/memory/nodes,
>
> Regards,
> Anna Gorska
>
>
>
>
> If you use the Sun Grid Engine, try „unset SGE_ROOT“ within your jobscript, before hitting the mpirun/charmrun. It’s similar on our cluster and seems to be related to the sge support within openmpi.
>
> Norman Geist.
>
> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im Auftrag von Anna Gorska
> Gesendet: Dienstag, 3. Dezember 2013 16:24
> An: Norman Geist
> Cc: Namd Mailing List
> Betreff: Re: namd-l: Running NAMD on SGE cluster - multiple modes and cores
>
>
> Hello,
> thank you for the quick response although it still doesn’t work. Without MPI it does not even produce error message, but you were right,
> I set the MPI environment variable to give the file with hosts to the mpi.
>
> But still it runs only on the one node.
> My new command looks like this:
> charmrun ++mpiexec ++remote-shell mpirun +p124 namd2 conf > out
>
> Regards,
> Anna Gorska
>
>
> Of course, if you use mpiexec, it is expected that the queuing system provides the list of machines to the mpirun directly, so the nodelist is not used. Try
>
> charmrun ++nodelist namd-machines +p104 namd2 in > out
>
> And read the instructions for using mpiexec again, as I don’t know it by heart now.
>
> Norman Geist.
>
> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im Auftrag von Anna Gorska
> Gesendet: Dienstag, 3. Dezember 2013 12:26
> An: namd-l_at_ks.uiuc.edu
> Betreff: namd-l: Running NAMD on SGE cluster - multiple modes and cores
>
> Hello,
>
> I try to run a NAMD (using NAMD_2.9_Linux-x86_64 version) on multiple nodes on SGE cluster.
>
> I am able to generate adequate file with list of nodes subscribed to me by the queuing system,
> but the NAMD runs always only on one node taking the specified number of cores -
> it behaves as if the ++nodelist was not there at all.
>
> This is the command I use:
>
> charmrun ++mpiexec ++remote-shell mpirun ++nodelist namd-machines +p104 namd2 in > out
>
>
> and the namd-machines file looks as follows:
>
> group main ++pathfix /step2_2_tmp
> host node508
> host node508
> host node508
> host node508
> host node508
> host node508
> host node508
> host node508
> host node508
> host node508
> host node508
> host node503
> host node503
> host node503
> host node503
> host node503
> host node503
>
>
> Sincerely,
> Anna Gorska
> ____________
>
> PHD student
> Algorithms in Bioinformatics
> University of Tuebingen
> Germany
>
>
>
> Diese E-Mail ist frei von Viren und Malware, denn der avast! Antivirus Schutz ist aktiv.
>
>
>
>
>
> Diese E-Mail ist frei von Viren und Malware, denn der avast! Antivirus Schutz ist aktiv.
>
>
>
>
>
> Diese E-Mail ist frei von Viren und Malware, denn der avast! Antivirus Schutz ist aktiv.
>
>
>
>
>
> Diese E-Mail ist frei von Viren und Malware, denn der avast! Antivirus Schutz ist aktiv.
>
>
>
>
>
> Diese E-Mail ist frei von Viren und Malware, denn der avast! Antivirus Schutz ist aktiv.
>
>
>
>
>
> Diese E-Mail ist frei von Viren und Malware, denn der avast! Antivirus Schutz ist aktiv.
>
>
>
>
>
> Diese E-Mail ist frei von Viren und Malware, denn der avast! Antivirus Schutz ist aktiv.
>

This archive was generated by hypermail 2.1.6 : Tue Dec 31 2013 - 23:24:03 CST