Using Your Clustermatic Cluster
This exercise should be done while logged in as a normal user,
not as root.  You can create a normal user account with the command
"adduser username" and then set the password with
"passwd username".
Part 1: Run NAMD
NAMD is a parallel molecular dynamics application developed in our
group.  It is the main application run on our clusters.
  -  Copy the files NAMD_2.6b1_Linux-i686-Clustermatic4.tar.gz (NAMD binary)
       and apoa1.tar.gz (sample NAMD simulation)
       from the workshop CD and untar them in your home directory with:
tar xzf apoa1.tar.gz
tar xzf NAMD_2.6b1_Linux-i686-Clustermatic4.tar.gz
 
 Yes, the file says Clustermatic 4 but you're running 5.  It's OK.
Clustermatic 5 is actually backwards compatible for a change.
-  cd NAMD_2.6b1_Linux-i686-Clustermatic4 
-  Start NAMD on all four machines with:
./charmrun +p4 ./namd2 ~/apoa1/apoa1.namd
 
 If you have problems, or want to see what's going in in the
	  launch process, add ++verbose to the charmrun command
	  line.  The charmrun program interacts with the bproc system to
	  find which nodes are up, including the master node.  If not
	  enough nodes are available it will start re-using nodes
	  (useful for SMP nodes).  Running charmrun without arguments
	  will list its other options, such as ++skipmaster.  Running
	  with ++skipmaster will only work if the NAMD input files are
	  available on the slaves.  NAMD does all of it's I/O from the
	  master process, so we run it on the master node and access our
	  main NFS servers.
-  When NAMD reaches the line that says "TIMING 20 ..." kill it with
       Control-C and jot down the wallclock s/step number. 
-  Run NAMD again on two processors (change +p4 above to +p2) for
       20 steps and compare the performance between the two.  Do four
       processors run twice as fast as four?  How close to twice? 
 Part 2: Compile and Run Tachyon
Tachyon is a parallel ray tracer developed by John Stone for his
master's thesis.  It is an example of a typical MPI application.
  -  Copy the file tachyon-0.97.tar.gz (Tachyon source and examples)
       from the workshop CD and untar them in your home directory with:
tar xzf tachyon-0.97.tar.gz
 
 
-  cd tachyon/unix 
-  Use a text editor to open the file Make-arch 
-  Search for the config options for "linux-lam" 
-  Copy this set of options to a new entry. 
-  Change (in the new entry) linux-lam to linux-mpich 
-  Change "CC = hcc" to "CC = gcc" 
-  Change -I$(LAMHOME)/h to -I/usr/mpich-p4/include 
-  Change -L$(LAMHOME)/lib to -L/usr/mpich-p4/lib 
-  Change -lmpi to -lmpich 
-  Save, quit the editor and run "make linux-mpich"
       to build tachyon.  If this doesn't work you probably missed
       on of the edits above, or applied them in the wrong place.
       The tachyon binary will end up in compile/linux-mpich/. 
-  cd (back to your home directory) 
-  Run Tachyon on the three slave machines with:
/usr/mpich-p4/bin/mpirun -d -p 3 \
  tachyon/compile/linux-mpich/tachyon +V tachyon/scenes/balls.dat
 
 The Clustermatic mpirun is broken and does not allow the master
node to be used for MPI jobs.  This is fine for their 1000-processor
clusters where they want minimal load on the master, but bad for us.
Tachyon reads input on every node, so the NFS mounting of /home on
the slaves is necessary.
-  Look at the timing output, which is broken into different
       stages of the calculation.  Run on two and one processors
       (change -p 3) and calculate speedups for the different
       stages as well as the total time. 
Part 3: Run Under Grid Engine
Sun Grid Engine (SGE) is a free, open souce, general purpose,
cross platform queueing system.  In the geneology of queueing systems,
it is a descendant of the free DQS package, which was commercialized
by a German company that was recently bought by Sun.
  -  Run "qstat -f" to see the queue that was automatically
       created.  There should be only one queue, for the master node.
       The states column at far right is used for error flags. 
-  Run "qconf -sq all.q" to see the queue setup for the
       cluster.  Note that there are many options to restrict
       user access, memory usage, runtime, etc. that are turned off
       by default.  The only unique thing is the qname and hostlist.
This is a newer version of Grid Engine than on the Rocks CD, so
there will be a few differences if you compare them. 
-  Use a text editor to create the file tachyon.job containing:
#$ -cwd
#$ -j y
#$ -S /bin/bash
/usr/mpich-p4/bin/mpirun -d -p `bpstat -t allup` \
  tachyon/compile/linux-mpich/tachyon +V tachyon/scenes/balls.dat
 
 Notice the similarity to the command for running Tachyon
manually.  Since SGE doesn't know about bproc or the slave nodes,
we use bpstat to find out how many slave nodes are up.
The options preceeded by #$ are parsed by SGE as if they were
specified on the command line.  -cwd causes the job to execute in
the current working directory.  -j y merges standard error and output
into a single file.  -S /bin/bash says to use the bash shell for this
script.
-  Submit the job to run on the full cluster with the command
       "qsub tachyon.job".  Note that there is only one queue for
       the job to go to, all.q. 
-  Use "qstat -f" to check on the job until it is scheduled,
       then look for output files named tachyon.job.oX and
       tachyon.job.poX, where X is the job number output by qsub.  View
       these files to see the output. 
-  Submit several jobs so that a backlog develops.  You can use the
       same tachyon.job file for all of them, just use the up arrow and
       hit return to submit jobs quickly.) Use qstat to monitor how the
       jobs are executed (the default scheduling policy is to take the
       earliest-submitted job that can be run, and the scheduler runs at
       regular intervals). 
-  Use a text editor to create the file namd.job containing:
#$ -cwd
#$ -j y
#$ -S /bin/bash
dir=$HOME/NAMD_2.6b1_Linux-i686-Clustermatic4
$dir/charmrun +p$((`bpstat -t allup` + 1)) $dir/namd2 ~/apoa1/apoa1.namd
 
 Since NAMD uses the head node, we use some shell magic
to add one to the number of available slave nodes returned by bpstat.
If these were dual-processor nodes, we would need to multiply by
two as well.
-  Submit the job with the command "qsub namd.job". 
-  Use qstat to monitor the job until it starts running, the use
       "tail -f namd.job.oX (X is the job number) to watch the
       job output. 
-  When you get tired of this, Control-C out of tail and use
       "qdel X" (X is the job number) to kill the job.  Use qstat
       to monitor the job until it is killed. 
-  We are going to add a queue, so become root with "su root"
       
-  Dump the all.q configuration with "qconf -sq all.q >
       /tmp/q 
-  Open /tmp/q with an editor and change qname to express,
       subordinate_list to all.q, and h_rt to 60. 
-  Load the file as a new queue with "qconf -Aq /tmp/q
       This creates a queue named express that will suspend jobs in
       the main queue for up to 60 seconds.  In real life you would use
       a longer time, like 1800 seconds (30 minutes). 
-  exit from the root shell to become a normal user. 
-  Submit a long-running NAMD job with "qsub -q all.q namd.job";
       use qstat to see that it starts in the default queue all.q.
       Now that we have multiple queues available it is important
           to be specific about which queue we want.  Otherwise, if
           the all.q queue was busy this job would run in the express
           queue and be killed after 60 seconds. 
-  Submit another NAMD job with "qsub -q express namd.job"; use
       qstat to see that it starts in the express queue, and that the
       old job has an S in the states column since it is suspended.
       
-  View the live output of the old job with "tail -f
       namd.job.oX" to see that it is stopped.  If you wait for a
       minute it should restart when the job in the express queue is
       killed for exceeding its time limit.  Having an express queue
       is very useful for short test and setup runs.  Only bproc-based
       systems like Clustermatic can do this smoothly for parallel runs
       (remember that SGE knows nothing about bproc).  You must,
       however, be sure that the cluster has enough memory for both
       jobs. 
Part 4: There Is No Part 4
 Compiling a program and running it under a queueing system is likely
    all you will ever do on your cluster.  We've done a typical
    application (Tachyon) and a not-so-typical one (NAMD).  At this
    point you might want to bpsh to a compute node to see what that
    environment is like, or go see how the Rocks folks are doing.  If
    you're really ambitious, download your own code and see if it
    comiles and runs. 
See Also
 Clustermatic web site (http://www.clustermatic.org/) 
 Grid Engine web site (http://gridengine.sunsource.net/) 
 NAMD web site (http://www.ks.uiuc.edu/Research/namd/) 
 Tachyon web site (http://jedi.ks.uiuc.edu/~johns/raytracer/)