Using Your Clustermatic Cluster

This exercise should be done while logged in as a normal user, not as root. You can create a normal user account with the command "adduser username" and then set the password with "passwd username".

Part 1: Run NAMD

NAMD is a parallel molecular dynamics application developed in our group. It is the main application run on our clusters.
  1. Copy the files NAMD_2.6b1_Linux-i686-Clustermatic4.tar.gz (NAMD binary) and apoa1.tar.gz (sample NAMD simulation) from the workshop CD and untar them in your home directory with:
    tar xzf apoa1.tar.gz
    tar xzf NAMD_2.6b1_Linux-i686-Clustermatic4.tar.gz
    
    Yes, the file says Clustermatic 4 but you're running 5. It's OK. Clustermatic 5 is actually backwards compatible for a change.
  2. cd NAMD_2.6b1_Linux-i686-Clustermatic4
  3. Start NAMD on all four machines with:
    ./charmrun +p4 ./namd2 ~/apoa1/apoa1.namd
    
    If you have problems, or want to see what's going in in the launch process, add ++verbose to the charmrun command line. The charmrun program interacts with the bproc system to find which nodes are up, including the master node. If not enough nodes are available it will start re-using nodes (useful for SMP nodes). Running charmrun without arguments will list its other options, such as ++skipmaster. Running with ++skipmaster will only work if the NAMD input files are available on the slaves. NAMD does all of it's I/O from the master process, so we run it on the master node and access our main NFS servers.
  4. When NAMD reaches the line that says "TIMING 20 ..." kill it with Control-C and jot down the wallclock s/step number.
  5. Run NAMD again on two processors (change +p4 above to +p2) for 20 steps and compare the performance between the two. Do four processors run twice as fast as four? How close to twice?

Part 2: Compile and Run Tachyon

Tachyon is a parallel ray tracer developed by John Stone for his master's thesis. It is an example of a typical MPI application.
  1. Copy the file tachyon-0.97.tar.gz (Tachyon source and examples) from the workshop CD and untar them in your home directory with:
    tar xzf tachyon-0.97.tar.gz
    
  2. cd tachyon/unix
  3. Use a text editor to open the file Make-arch
  4. Search for the config options for "linux-lam"
  5. Copy this set of options to a new entry.
  6. Change (in the new entry) linux-lam to linux-mpich
  7. Change "CC = hcc" to "CC = gcc"
  8. Change -I$(LAMHOME)/h to -I/usr/mpich-p4/include
  9. Change -L$(LAMHOME)/lib to -L/usr/mpich-p4/lib
  10. Change -lmpi to -lmpich
  11. Save, quit the editor and run "make linux-mpich" to build tachyon. If this doesn't work you probably missed on of the edits above, or applied them in the wrong place. The tachyon binary will end up in compile/linux-mpich/.
  12. cd (back to your home directory)
  13. Run Tachyon on the three slave machines with:
    /usr/mpich-p4/bin/mpirun -d -p 3 \
      tachyon/compile/linux-mpich/tachyon +V tachyon/scenes/balls.dat
    
    The Clustermatic mpirun is broken and does not allow the master node to be used for MPI jobs. This is fine for their 1000-processor clusters where they want minimal load on the master, but bad for us. Tachyon reads input on every node, so the NFS mounting of /home on the slaves is necessary.
  14. Look at the timing output, which is broken into different stages of the calculation. Run on two and one processors (change -p 3) and calculate speedups for the different stages as well as the total time.

Part 3: Run Under Grid Engine

Sun Grid Engine (SGE) is a free, open souce, general purpose, cross platform queueing system. In the geneology of queueing systems, it is a descendant of the free DQS package, which was commercialized by a German company that was recently bought by Sun.
  1. Run "qstat -f" to see the queue that was automatically created. There should be only one queue, for the master node. The states column at far right is used for error flags.
  2. Run "qconf -sq all.q" to see the queue setup for the cluster. Note that there are many options to restrict user access, memory usage, runtime, etc. that are turned off by default. The only unique thing is the qname and hostlist. This is a newer version of Grid Engine than on the Rocks CD, so there will be a few differences if you compare them.
  3. Use a text editor to create the file tachyon.job containing:
    #$ -cwd
    #$ -j y
    #$ -S /bin/bash
    /usr/mpich-p4/bin/mpirun -d -p `bpstat -t allup` \
      tachyon/compile/linux-mpich/tachyon +V tachyon/scenes/balls.dat
    
    Notice the similarity to the command for running Tachyon manually. Since SGE doesn't know about bproc or the slave nodes, we use bpstat to find out how many slave nodes are up. The options preceeded by #$ are parsed by SGE as if they were specified on the command line. -cwd causes the job to execute in the current working directory. -j y merges standard error and output into a single file. -S /bin/bash says to use the bash shell for this script.
  4. Submit the job to run on the full cluster with the command "qsub tachyon.job". Note that there is only one queue for the job to go to, all.q.
  5. Use "qstat -f" to check on the job until it is scheduled, then look for output files named tachyon.job.oX and tachyon.job.poX, where X is the job number output by qsub. View these files to see the output.
  6. Submit several jobs so that a backlog develops. You can use the same tachyon.job file for all of them, just use the up arrow and hit return to submit jobs quickly.) Use qstat to monitor how the jobs are executed (the default scheduling policy is to take the earliest-submitted job that can be run, and the scheduler runs at regular intervals).
  7. Use a text editor to create the file namd.job containing:
    #$ -cwd
    #$ -j y
    #$ -S /bin/bash
    
    dir=$HOME/NAMD_2.6b1_Linux-i686-Clustermatic4
    $dir/charmrun +p$((`bpstat -t allup` + 1)) $dir/namd2 ~/apoa1/apoa1.namd
    
    Since NAMD uses the head node, we use some shell magic to add one to the number of available slave nodes returned by bpstat. If these were dual-processor nodes, we would need to multiply by two as well.
  8. Submit the job with the command "qsub namd.job".
  9. Use qstat to monitor the job until it starts running, the use "tail -f namd.job.oX (X is the job number) to watch the job output.
  10. When you get tired of this, Control-C out of tail and use "qdel X" (X is the job number) to kill the job. Use qstat to monitor the job until it is killed.
  11. We are going to add a queue, so become root with "su root"
  12. Dump the all.q configuration with "qconf -sq all.q > /tmp/q
  13. Open /tmp/q with an editor and change qname to express, subordinate_list to all.q, and h_rt to 60.
  14. Load the file as a new queue with "qconf -Aq /tmp/q This creates a queue named express that will suspend jobs in the main queue for up to 60 seconds. In real life you would use a longer time, like 1800 seconds (30 minutes).
  15. exit from the root shell to become a normal user.
  16. Submit a long-running NAMD job with "qsub -q all.q namd.job"; use qstat to see that it starts in the default queue all.q. Now that we have multiple queues available it is important to be specific about which queue we want. Otherwise, if the all.q queue was busy this job would run in the express queue and be killed after 60 seconds.
  17. Submit another NAMD job with "qsub -q express namd.job"; use qstat to see that it starts in the express queue, and that the old job has an S in the states column since it is suspended.
  18. View the live output of the old job with "tail -f namd.job.oX" to see that it is stopped. If you wait for a minute it should restart when the job in the express queue is killed for exceeding its time limit. Having an express queue is very useful for short test and setup runs. Only bproc-based systems like Clustermatic can do this smoothly for parallel runs (remember that SGE knows nothing about bproc). You must, however, be sure that the cluster has enough memory for both jobs.

Part 4: There Is No Part 4

Compiling a program and running it under a queueing system is likely all you will ever do on your cluster. We've done a typical application (Tachyon) and a not-so-typical one (NAMD). At this point you might want to bpsh to a compute node to see what that environment is like, or go see how the Rocks folks are doing. If you're really ambitious, download your own code and see if it comiles and runs.

See Also

Clustermatic web site (http://www.clustermatic.org/)

Grid Engine web site (http://gridengine.sunsource.net/)

NAMD web site (http://www.ks.uiuc.edu/Research/namd/)

Tachyon web site (http://jedi.ks.uiuc.edu/~johns/raytracer/)