Re: Running NAMD on SGE cluster - multiple modes and cores

From: Anna Gorska (gvalchca.agr_at_gmail.com)
Date: Fri Dec 06 2013 - 07:23:29 CST

Hello Norman,

I managed to run NAMD on two nodes of SGE cluster through establishing my own ssh connection that allows to allocate more resources. I submit 2 jobs into the queue, every job write the name of the node and number of cpus it was given to the common file, and every job starts sshd port listening process.

Next I run it :

charmrun ++remote-shell my_ssh_connect.sh ++verbose ++nodelist namd-machines-test +pNodes namd2 in> out 2>error

where my_ssh_connect uses an established previously port. The trick is also to disable PAM and increase MaxSessions parameters in sshd_config file.

But it turned out not to scale good. Using this method on 93 CPUS i compute 1033 ps and using normal single-node run 1064 ps in 12 hours. Any suggestions how to speed it up ?

Regards,
Anna Gorska

> And does namd write a simulation log?
>
> Norman Geist.
>
> Von: Anna Gorska [mailto:gvalchca.agr_at_gmail.com]
> Gesendet: Donnerstag, 5. Dezember 2013 11:14
> An: Norman Geist
> Cc: Namd Mailing List
> Betreff: Re: namd-l: Running NAMD on SGE cluster - multiple modes and cores
>
>
> This is what I got:
>
> [proxy:0:0_at_node502] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:928): assert (!closed) failed
> [proxy:0:0_at_node502] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error stat
> us
> [proxy:0:0_at_node502] main (./pm/pmiserv/pmip.c:226): demux engine error waiting for event
> [mpiexec_at_node502] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:70): one of the processes
> terminated badly; aborting
> [mpiexec_at_node502] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error
> waiting for completion
> [mpiexec_at_node502] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:191): launcher returned error waiti
> ng for completion
> [mpiexec_at_node502] main (./ui/mpich/mpiexec.c:405): process manager error waiting for completion
>
> Regards,
> Anna Gorska
>
>
> Something in the namd log? Also, to catch the stderr output you can append “2> errors” to your call.
>
> Norman Geist.
>
> Von: Anna Gorska [mailto:gvalchca.agr_at_gmail.com]
> Gesendet: Donnerstag, 5. Dezember 2013 11:01
> An: Norman Geist
> Betreff: Re: namd-l: Running NAMD on SGE cluster - multiple modes and cores
>
>
> Ok,
>
> it runs on both nodes parallel for some time and finishes with:
>
> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> = EXIT CODE: 9
> = CLEANING UP REMAINING PROCESSES
> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>
> Regards,
> Anna Gorska
>
>
> Don’t your SGE provide a machinefile? I usually use something like
>
> mpirun –np $NSLOTS –np $TMPDIR/machines namd2 conf > out
>
> Norman Geist.
>
> Von: Anna Gorska [mailto:gvalchca.agr_at_gmail.com]
> Gesendet: Mittwoch, 4. Dezember 2013 15:47
> An: Norman Geist
> Betreff: Re: namd-l: Running NAMD on SGE cluster - multiple modes and cores
>
> Hello,
> running like this with SGE_UNSET via qsub
> charmrun ++verbose ++mpiexec ++remote-shell /usr/bin/mpirun +p20 namd2 conf > out
> makes namd to run only on one node (although the node file is proper).
>
> If I run from qsub - runs on one sever and doesn’t crash.
> mpirun -np 20 namd2 conf > out
> I am using the SGE parallel environment,
>
> Regards,
> Anna Gorska
>
>
>
> This is another issue. The charmrun ibverbs stuff isn’t working with every infiniband hardware. Therefore one would usually build namd against mpi. This you try “unset SGE_ROOT” and starting with mpiexec? Do also try to leave out charmrun at all and simply use mpirun.
>
> Norman Geist.
>
> Von: Anna Gorska [mailto:gvalchca.agr_at_gmail.com]
> Gesendet: Mittwoch, 4. Dezember 2013 12:52
> An: Norman Geist
> Betreff: Re: namd-l: Running NAMD on SGE cluster - multiple modes and cores
>
> Hello,
>
> I identified the problem, although don’t know how to fix it.
> You were right it works also without mpi as you suggested:
>
> charmrun ++verbose ++nodelist namd-machines-test +p10 namd2 conf > out
>
> but than it runs on requested nodes and cpus for about a minute and crashes with:
>
> Charmrun: error on requested socket—
> Socket closed before recv.
>
> It repeats always - independently on the number of CPUs/memory/nodes,
>
> Regards,
> Anna Gorska
>
>
>
>
> If you use the Sun Grid Engine, try „unset SGE_ROOT“ within your jobscript, before hitting the mpirun/charmrun. It’s similar on our cluster and seems to be related to the sge support within openmpi.
>
> Norman Geist.
>
> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im Auftrag von Anna Gorska
> Gesendet: Dienstag, 3. Dezember 2013 16:24
> An: Norman Geist
> Cc: Namd Mailing List
> Betreff: Re: namd-l: Running NAMD on SGE cluster - multiple modes and cores
>
>
> Hello,
> thank you for the quick response although it still doesn’t work. Without MPI it does not even produce error message, but you were right,
> I set the MPI environment variable to give the file with hosts to the mpi.
>
> But still it runs only on the one node.
> My new command looks like this:
> charmrun ++mpiexec ++remote-shell mpirun +p124 namd2 conf > out
>
> Regards,
> Anna Gorska
>
>
> Of course, if you use mpiexec, it is expected that the queuing system provides the list of machines to the mpirun directly, so the nodelist is not used. Try
>
> charmrun ++nodelist namd-machines +p104 namd2 in > out
>
> And read the instructions for using mpiexec again, as I don’t know it by heart now.
>
> Norman Geist.
>
> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im Auftrag von Anna Gorska
> Gesendet: Dienstag, 3. Dezember 2013 12:26
> An: namd-l_at_ks.uiuc.edu
> Betreff: namd-l: Running NAMD on SGE cluster - multiple modes and cores
>
> Hello,
>
> I try to run a NAMD (using NAMD_2.9_Linux-x86_64 version) on multiple nodes on SGE cluster.
>
> I am able to generate adequate file with list of nodes subscribed to me by the queuing system,
> but the NAMD runs always only on one node taking the specified number of cores -
> it behaves as if the ++nodelist was not there at all.
>
> This is the command I use:
>
> charmrun ++mpiexec ++remote-shell mpirun ++nodelist namd-machines +p104 namd2 in > out
>
>
> and the namd-machines file looks as follows:
>
> group main ++pathfix /step2_2_tmp
> host node508
> host node508
> host node508
> host node508
> host node508
> host node508
> host node508
> host node508
> host node508
> host node508
> host node508
> host node503
> host node503
> host node503
> host node503
> host node503
> host node503
>
>
> Sincerely,
> Anna Gorska
> ____________
>
> PHD student
> Algorithms in Bioinformatics
> University of Tuebingen
> Germany
>
>
>
> Diese E-Mail ist frei von Viren und Malware, denn der avast! Antivirus Schutz ist aktiv.
>
>
>
>
>
> Diese E-Mail ist frei von Viren und Malware, denn der avast! Antivirus Schutz ist aktiv.
>
>
>
>
>
> Diese E-Mail ist frei von Viren und Malware, denn der avast! Antivirus Schutz ist aktiv.
>
>
>
>
>
> Diese E-Mail ist frei von Viren und Malware, denn der avast! Antivirus Schutz ist aktiv.
>
>
>
>
>
> Diese E-Mail ist frei von Viren und Malware, denn der avast! Antivirus Schutz ist aktiv.
>
>
>
>
>
> Diese E-Mail ist frei von Viren und Malware, denn der avast! Antivirus Schutz ist aktiv.
>

This archive was generated by hypermail 2.1.6 : Tue Dec 31 2013 - 23:24:04 CST