Re: Running NAMD on SGE cluster - multiple modes and cores

From: Anna Gorska (gvalchca.agr_at_gmail.com)
Date: Thu Dec 05 2013 - 04:14:05 CST

This is what I got:

[proxy:0:0_at_node502] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:928): assert (!closed) failed
[proxy:0:0_at_node502] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error stat
us
[proxy:0:0_at_node502] main (./pm/pmiserv/pmip.c:226): demux engine error waiting for event
[mpiexec_at_node502] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:70): one of the processes
terminated badly; aborting
[mpiexec_at_node502] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error
 waiting for completion
[mpiexec_at_node502] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:191): launcher returned error waiti
ng for completion
[mpiexec_at_node502] main (./ui/mpich/mpiexec.c:405): process manager error waiting for completion

Regards,
Anna Gorska

> Something in the namd log? Also, to catch the stderr output you can append “2> errors” to your call.
>
> Norman Geist.
>
> Von: Anna Gorska [mailto:gvalchca.agr_at_gmail.com]
> Gesendet: Donnerstag, 5. Dezember 2013 11:01
> An: Norman Geist
> Betreff: Re: namd-l: Running NAMD on SGE cluster - multiple modes and cores
>
>
> Ok,
>
> it runs on both nodes parallel for some time and finishes with:
>
> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> = EXIT CODE: 9
> = CLEANING UP REMAINING PROCESSES
> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>
> Regards,
> Anna Gorska
>
>
> Don’t your SGE provide a machinefile? I usually use something like
>
> mpirun –np $NSLOTS –np $TMPDIR/machines namd2 conf > out
>
> Norman Geist.
>
> Von: Anna Gorska [mailto:gvalchca.agr_at_gmail.com]
> Gesendet: Mittwoch, 4. Dezember 2013 15:47
> An: Norman Geist
> Betreff: Re: namd-l: Running NAMD on SGE cluster - multiple modes and cores
>
> Hello,
> running like this with SGE_UNSET via qsub
> charmrun ++verbose ++mpiexec ++remote-shell /usr/bin/mpirun +p20 namd2 conf > out
> makes namd to run only on one node (although the node file is proper).
>
> If I run from qsub - runs on one sever and doesn’t crash.
> mpirun -np 20 namd2 conf > out
> I am using the SGE parallel environment,
>
> Regards,
> Anna Gorska
>
>
>
> This is another issue. The charmrun ibverbs stuff isn’t working with every infiniband hardware. Therefore one would usually build namd against mpi. This you try “unset SGE_ROOT” and starting with mpiexec? Do also try to leave out charmrun at all and simply use mpirun.
>
> Norman Geist.
>
> Von: Anna Gorska [mailto:gvalchca.agr_at_gmail.com]
> Gesendet: Mittwoch, 4. Dezember 2013 12:52
> An: Norman Geist
> Betreff: Re: namd-l: Running NAMD on SGE cluster - multiple modes and cores
>
> Hello,
>
> I identified the problem, although don’t know how to fix it.
> You were right it works also without mpi as you suggested:
>
> charmrun ++verbose ++nodelist namd-machines-test +p10 namd2 conf > out
>
> but than it runs on requested nodes and cpus for about a minute and crashes with:
>
> Charmrun: error on requested socket—
> Socket closed before recv.
>
> It repeats always - independently on the number of CPUs/memory/nodes,
>
> Regards,
> Anna Gorska
>
>
>
>
> If you use the Sun Grid Engine, try „unset SGE_ROOT“ within your jobscript, before hitting the mpirun/charmrun. It’s similar on our cluster and seems to be related to the sge support within openmpi.
>
> Norman Geist.
>
> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im Auftrag von Anna Gorska
> Gesendet: Dienstag, 3. Dezember 2013 16:24
> An: Norman Geist
> Cc: Namd Mailing List
> Betreff: Re: namd-l: Running NAMD on SGE cluster - multiple modes and cores
>
>
> Hello,
> thank you for the quick response although it still doesn’t work. Without MPI it does not even produce error message, but you were right,
> I set the MPI environment variable to give the file with hosts to the mpi.
>
> But still it runs only on the one node.
> My new command looks like this:
> charmrun ++mpiexec ++remote-shell mpirun +p124 namd2 conf > out
>
> Regards,
> Anna Gorska
>
>
> Of course, if you use mpiexec, it is expected that the queuing system provides the list of machines to the mpirun directly, so the nodelist is not used. Try
>
> charmrun ++nodelist namd-machines +p104 namd2 in > out
>
> And read the instructions for using mpiexec again, as I don’t know it by heart now.
>
> Norman Geist.
>
> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im Auftrag von Anna Gorska
> Gesendet: Dienstag, 3. Dezember 2013 12:26
> An: namd-l_at_ks.uiuc.edu
> Betreff: namd-l: Running NAMD on SGE cluster - multiple modes and cores
>
> Hello,
>
> I try to run a NAMD (using NAMD_2.9_Linux-x86_64 version) on multiple nodes on SGE cluster.
>
> I am able to generate adequate file with list of nodes subscribed to me by the queuing system,
> but the NAMD runs always only on one node taking the specified number of cores -
> it behaves as if the ++nodelist was not there at all.
>
> This is the command I use:
>
> charmrun ++mpiexec ++remote-shell mpirun ++nodelist namd-machines +p104 namd2 in > out
>
>
> and the namd-machines file looks as follows:
>
> group main ++pathfix /step2_2_tmp
> host node508
> host node508
> host node508
> host node508
> host node508
> host node508
> host node508
> host node508
> host node508
> host node508
> host node508
> host node503
> host node503
> host node503
> host node503
> host node503
> host node503
>
>
> Sincerely,
> Anna Gorska
> ____________
>
> PHD student
> Algorithms in Bioinformatics
> University of Tuebingen
> Germany
>
>
>
> Diese E-Mail ist frei von Viren und Malware, denn der avast! Antivirus Schutz ist aktiv.
>
>
>
>
>
> Diese E-Mail ist frei von Viren und Malware, denn der avast! Antivirus Schutz ist aktiv.
>
>
>
>
>
> Diese E-Mail ist frei von Viren und Malware, denn der avast! Antivirus Schutz ist aktiv.
>
>
>
>
>
> Diese E-Mail ist frei von Viren und Malware, denn der avast! Antivirus Schutz ist aktiv.
>
>
>
>
>
> Diese E-Mail ist frei von Viren und Malware, denn der avast! Antivirus Schutz ist aktiv.
>

This archive was generated by hypermail 2.1.6 : Wed Dec 31 2014 - 23:21:58 CST