Cluster Workshop - Build Your Own Clustermatic Cluster

Installation Instructions for Fedora Core 4 and Clustermatic 5

You should have the following parts:

4 single-processor Athlon PCs
5 Intel network cards (already installed)
4 network cables
1 fast ethernet switch and its power adapter
1 keyboard
1 mouse
1 monitor
1 power strip
5 power cables
4 Fedora Core 4 CD-ROMs (discs 1, 2, 3 & 4, from fedora.redhat.com)
3 Clustermatic 5 CD-ROMs (from www.clustermatic.org)
1 TCB Cluster CD-ROM (Sun Grid Engine Packages, NAMD examples and binaries)

Part 1: Install Fedora Core 4 on the Master Node

If you've installed Fedora Core before, the following may be quite tedious. If you've never installed Fedora Core before, the following may be quite mysterious. It's a necessary evil in either case.

Plug the monitor into the power strip and turn it on.
Find the machine with two network cards; this is the master node.
Plug the master node into the power strip and connect the monitor, keyboard, and mouse.
Power on the master node, open the CD-ROM drive, insert Fedora Core disk 1, and press the reset button. The machine should boot from the CD-ROM.
If you wait too long and your machine starts booting off of the hard drive just press the reset button to make it boot from the CD-ROM. If your machine still insists on booting from the hard drive you may need to modify its BIOS settings.
When the Fedora Core screen comes up, hit enter.
If you don't have a mouse it is suggested that you type linux text at this point to do a text based install. The process is very similar to the graphical install.
Skip testing the CD media.
This takes far too long and has no real benefit for fresh installs.
Click Next to the "Welcome to Fedora Core Linux!" message.
Select English as your installation language.
Select a US model keyboard.
If your mouse was not automatically detected, select a generic 3-button PS/2 mouse.
If the installer detects that another version of Linux is installed, click Install Fedora Core and hit Next.
Select a Workstation install.
We typically use Custom, but Workstation is good enough for now.
Select Autopartition.
We don't store files on our cluster machines, even on the master nodes, so it doesn't matter how the disk is set up. We use dedicated fileservers for storage.
Select "Remove all partitions on this system".
Again, we don't keep data on cluster machines.
Yes, you really want to delete all existing data.
Of course, at home you might not want to do this.
Click Next at the GRUB boot loader screen.
At the network configuration screen set both cards to "Activate on boot." Select device eth0 and click edit. In the dialog that appears, unlick the Configure using DHCP checkbox and then enter 10.0.4.1 in the IP address Field and 255.255.255.0 in the Netmask field. Click Ok after you have made these changes and Next to the network configuration screen.
This will be the interface to the private network.
"Eth1" is for the outside network, and you should input the IP address given to you by your instructor. The netmask should be 255.255.255.0. Select "OK" when done with the interface.
Enter these settings:
```
Gateway:        130.126.120.1
Primary DNS:    130.126.120.32
Secondary DNS:  130.126.120.33
Tertiary DNS:   130.126.116.194
```
and select "OK" to continue.
Note: these values are specific to our network. If you want to set up your own cluster later on, you'll have to get these addresses from your local sysadmin (which might be you!).
Disable the firewall and SELinux.
In most cases your cluster will not be connecting to the outside world, so it should be safe to disable the firewall if you trust your network, if not you'll need to enable it.
Click Proceed when the installer warns you about not having a firewall.
The hardware clock should be set to GMT. Pick your time zone.
Pick a root password that you will remember. Write it down.
You don't need to customize the software selection or pick individual packages.
However, you may want to do this for a production system. This is by far the easiest time to add packages to your cluster. On the other hand, the default install has 2 GB of software, so you could save some time in the next step if you pared the list down.
Start the installation. It will take between 15 and 25 minutes to install Fedora Core 4 and will prompt you as necessary for additional disks.
Make a boot floppy.
Having a Linux boot floppy can be invaluable. A floppy made now will be unable to load kernel modules once you install Clustermatic, but it will still allow you to boot your machine and fix any misconfigurations. You probably won't need it today, though.
At this point your Fedora Core 4 box is installed. Reboot the system when prompted to.
After rebooting, the Welcome to Fedora Core 4 screen will pop up. Click Next.
You will need to Agree to the License Agreement before continuing.
Verify that the computer is set to the correct time, if not change it to be correct.
At the Display Configuration screen you can either just click next, or adjust the monitor configuration to "Generic LCD 1024x768". After you have done this you can adjust the default screen resolution to 1024x768.
You would normally only use the console of a production cluster during initial configuration or adding nodes, and you don't need a GUI for either of those, so there is little reason to configure X-Windows. Having multiple terminals available will be useful for this exercise, so we'll go ahead and configure X-Windows anyway.
Create a username and password for yourself.
Click Next at the sound card configuration screen.
Click Next at the Additional CDs screen.
Click Next at the Finish Setup screen.
Congratulations, you've installed Fedora Core 4.

Part 2: Install Clustermatic 5 on the Master Node

The following will be new to everyone. You'll need to know how to use a unix text editor. The examples below use the mouse-driven editor "gedit" rather than the more common "vi".

Go to Desktop, System Settings, Security Level. Enable the Firewall, and set eth0 to be a trusted device, or the slaves won't be able to download the kernel.
Open a terminal (right-click on the desktop, Open Terminal).
In order to ensure that one network interface is persistently named "eth0" and the other "eth1", in practice we actually use two completely different cards (one from Intel, the other 3COM). If you wish to do this with your own clusters, you can perform the following step. However, today you have two identical cards (both Intel), and the Linux kernel maintains naming consistency of network cards (barring hardware changes), so we only need to figure out which card is which interface once. So, you can connect the private and public connections to the cards, and if this configuration fails, all you have to do is swap the two cables.
- Run gedit /etc/modules.conf and swap eth0 and eth1 if necessary so that the network alias lines read:
```
alias eth0 eepro100
alias eth1 3c59x
```
  This will ensure that the Intel card always appears as "eth0" and the 3COM card as "eth1."
Insert the Clustermatic 5 CD and wait for it to appear on your desktop.
If you're not running X-Windows you need to mount it with mount /media/cdrom
Install Clustermatic with rpm -ivh --force /media/cdrom/RPMS/i686/kernel*
Substitute the directory for your architecture if you are not using a 32 bit Intel or AMD processor. The --force option enables you to install a version previous to the one you currently have running.
Make an initrd image for this new kernel with /sbin/mkinitrd /boot/initrd-2.6.9-cm46 2.6.9-cm46
If you installed a different kernel in the provious step adjust accordingly.
Add this line to the libraries section:
```
libraries /lib/libtermcap* /lib/libdl* /usr/lib/libz* /lib/libgcc_s*
```
This ensures that libraries needed by NAMD are available on the slave nodes.
gedit /boot/grub/grub.conf and add:
```
title Clustermatic
	kernel /vmlinuz-2.6.9-cm46 root=/dev/VolGroup00/LogVol00
	initrd /boot/initrd-2.6.9-cm46
```
and edit the default line (a zero-base index into the list of kernels that follows) to point to the new entry you made
Make sure that the root device is the same as for the old kernel. If you installed a different kernel in the previous step you should adjust the kernel and initrd image appropriately.
Copy the compat-libstdc++-33 package (required for NAMD) from the workshop CD or website and install it with rpm -i compat-libstdc++*
Unmount the Clustermatic CD with eject and remove it.
Reboot the computer into the new kernel that you just installed.
Login again and open up a terminal window.
Insert the Clustermatic 5 CD again.
Install the remaining packages with rpm -ivh /media/cdrom/RPMS/i586/beo*.rpm /media/cdrom/RPMS/i586/m*.rpm /media/cdrom/RPMS/i586/bp*.rpm
gedit /etc/clustermatic/config and edit the following lines:
- Verify that your interface line reads:
```
interface eth0
```
- Change the nodes line to the number of slave nodes you will have:
```
nodes 3
```
- Change the iprange line to provide the corresponding number of addresses:
```
iprange 0 10.0.4.10 10.0.4.12
```
- Change the kernelimage to point at the proper kernel:
```
kernelimage /boot/vmlinuz-2.6.9-cm46
```
  If you installed a different kernel in previous steps, adjust accordingly.
If you are going to want to share home directories to the slave nodes, then gedit /etc/clustermatic/config.boot to add
```
bootmodule nfs
modprobe nfs
```
also gedit /etc/clustermatic/fstab to add
```
MASTER:/home	/home	nfs defaults	0 0
```
and gedit /etc/exports to add (with a tab beween /home and *)
```
/home    *(rw,sync)
```
and finally /sbin/chkconfig nfs on and /sbin/service nfs start
You may remove the CD.

Part 3: Attach and Boot the Slave Nodes

Plug the private network switch into an outlet on the power strip.
Connect one of the master node's network cards to the switch, and the other to the "outside world" (we should have several larger switches connected to the outside network; ask if there is any confusion).
Log in as root and open a terminal.
Try pinging www.ks.uiuc.edu with the command "ping -c 1 130.126.120.32". If this fails ("unknown host www.ks.uiuc.edu"), then swap the master node's network cables, wait a few seconds, and try again. If you still have trouble, feel free to ask for assistance.
Create the level 2 boot image with beoboot -2 -n
This builds the second stage boot image, which the slaves will download from the master over the network. You only need to run this command when you change the boot options in /etc/clustermatic/config or /etc/clustermatic/config.boot.
Start up Clustermatic services with /sbin/service clustermatic start
Open a second terminal and run /usr/lib/beoboot/bin/nodeadd -a eth0 there. The nodeadd program will run until you kill it with Ctrl-C. Leave it running!
This process is only needed when adding new nodes to the cluster. The nodeadd program captures the hardware ethernet address of any machine trying to boot on the private network (eth0), adds it to node list in /etc/clustermatic/config, and makes the beoboot daemon read the new list (-a). When a new node is detected nodeadd will print the hardware address followed by a message about sending SIGHUP to beoserv.
For each slave node plug in its power cable and network cable.
Power on each slave node and insert a Clustermatic 5 CD.
Switch to the second terminal and kill nodeadd with Ctrl-C
Run tail /etc/clustermatic/config to see the new (uncommented) node addresses.
Check the status of the cluster with bpstat
Make sure as many nodes are up as the number of slaves you have. If you had not modified the nodes and iprange lines in /etc/clustermatic/config to match the size of your cluster, you would see the extra nodes harmlessly listed as down.
Examine the log file from node 0 with less /var/log/beowulf/node.0
Each node has its own log file in /var/log/clustermatic. These log files only contain output from the final stages of slave startup, after the second stage kernel has contacted the master node.
View the kernel messages from node 0 with bpsh 0 dmesg | less
The bpsh command allows any binary installed on the master node to execute on one or more slave nodes (see options in the appendix). Interpreted scripts or programs requiring files found only on the master node cannot be run via bpsh.
Reboot all the slaves with bpctl --slave all --reboot
If you see any "Node is down" messages, these indicate that fewer than the number of nodes given in /etc/clustermatic/config were up when you issued the command.
Log out.

There is more information about using ClusterMatic at the end of this guide.

Part 4: Installing Sun Grid Engine

Log into the system and open a terminal.
Begin the installation with adding a user to run SGE. adduser sgeadmin
Change to the home directory of the sgeadmin user you just created. cd /home/sgeadmin
Insert the TCB Cluster CD into the Master Node.
Unpack the Common and Platform specific packages off the CD into the home directory of the sgeadmin user.
```
tar xzf /media/cdrom/sge-6.0u6-bin-lx24-x86.tar.gz
tar xzf /media/cdrom/sge-6.0u6-common.tar.gz
```
Set your SGE_ROOT environment variable to the sgeadmin's home directory. export SGE_ROOT=/home/sgeadmin
Run the setfileperm.sh script provided to fix file permissions. util/setfileperm.sh $SGE_ROOT
Run gedit /etc/services and add the lines
```
sge_qmaster 536/tcp
sge_execd 537/tcp
```
in the appropriate place. Save the file.
Run the QMaster installer.
```
# cd $SGE_ROOT
# ./install_qmaster
```
Hit Return at the introduction screen.
When choosing the Grid Engine admin user account hit y to specify a non-root user, and then enter sgeadmin as the user account. Hit return to progress.
Verify that the Grid Engine root directory is set to /home/sgeadmin.
Since we have set the ports needed from sge_qmaster and sge_engine in a previous step, we should be able to hit Return through the next two prompts.
Hit return to set the name of your cell to "default".
Use the default install options for the spool directory configuration.
We already ran the file permission script so we can hit yes and skip this step.
Since we are only going to have one execution host (cluster) we can say y to them being in the same domain name.
The install script will then create some directories.
Use the default options for the Spooling/DB questions.
When prompted for group id range, use the default range of 20000-20100 unless you have a reason to do otherwise.
Use the default options for the spool directory.
The next step asks you to input an email address for the user who should receive problem reports. Typically this will be the person responsible for maintaining the cluster, but for now enter root@localhost
Verify that your configuration options are correct.
Hit yes so that the qmaster will startup when the computer boots.
The next step asks you to enter in the names of your Execution Hosts (clusters). Say no to using a filename and then when prompted for a host, enter localhost.
The next thing that the configuration program will ask you to do is to select a scheduler profile. Normal will work for most situations, so that's what we'll use now.
Our queue master is now installed. Run . /home/sgeadmin/default/common/settings.sh to set some environment variables up. Note, that you should add this line into your login shell so have access to the grid engine utilities.
Each cluster must also have the execution host installed on it. In this case our only cluster is the one we've been setting up. Begin by running qconf -sh. If localhost is not listed as execution host you will need to add it as a administrative host by running qconf -ah hostname.
Use . /home/sgeadmin/install_execd to start up the Execution host configuration script.
Like the qmaster installation we can use all of the default options.
After install_execd finishes running, use . /home/sgeadmin/default/common/settings.sh to set our environmental variables accordingly.
Congratulations you now have a queuing system setup for your cluster. Now to do some real work.

Appendix: Usage Options for Common Bproc Utilities

bpstat: monitor status of slave nodes

Usage: bpstat [options] [nodes ...]
  -h,--help     Display this message and exit.
  -v,--version  Display version information and exit.

 The nodes argument is a comma delimited list of the following:
   Single node numbers - "4"    means node number 4
   Node ranges         - "5-8"  means node numbers 5,6,7,8
   Node classes        - "allX" means all slave nodes with status X
                         "all"  means all slave nodes
 More than one nodes argument can be given.
 Valid node states are:
        down    boot    error   unavailable     up

 Node list display flags:
  -c,--compact      Print compacted listing of nodes. (default)
  -l,--long         Print long listing of nodes.
  -a,--address      Print node addresses.
  -s,--status       Print node status.
  -n,--number       Print node numbers.
  -t,--total        Print total number of nodes.

 Node list sorting flags:
  -R,--sort-reverse Reverse sort order.
  -N,--sort-number  Sort by node number.
  -S,--sort-status  Sort by node status.
  -O,--keep-order   Don't sort node list.

 Misc options:
  -U,--update   Continuously update status
  -L,--lock     "locked" mode for running on an unattended terminal
  -A hostname   Print the node number that corresponds to a
                host name or IP address.
  -p            Display process state.
  -P            Eat "ps" output and augment. (doesnt work well.)

bpctl: alter state of slave nodes

Usage: bpctl [options]
  -h,--help              Print this message and exit
  -v,--version           Print version information and exit
  -M,--master            Send a command to the master node
  -S num,--slave num     Send a command to slave node num

  -s state,--state state Set the state of the node to state
  -r dir,--chroot dir    Cause slave daemon to chroot to dir
  -R,--reboot            Reboot the slave node
  -H,--halt              Halt the slave node
  -P,--pwroff            Power off the slave node
  --cache-purge-fail     Purge library cache fail list
  --cache-purge          Purge library cache
  --reconnect master[:port[,local[:port]]]
                         Reconnect to front end.

  -m mode,--mode mode    Set the permission bits of a node
  -u user,--user user    Set the user ID of a node
  -g group,--group group Set the group ID of a node

  -f                     Fast - do not wait for acknowledgement from
                         remote nodes when possible.

The valid node states are:
        down    boot    error   unavailable     up

bpsh: run programs on slave nodes

Usage: bpsh [options] nodenumber command
       bpsh -a [options] command
       bpsh -A [options] command
       -h     Display this message and exit
       -v     Display version information and exit
  Node selection options:
       -a     Run the command on all nodes which are up.
       -A     Run the command on all nodes which are not down.
  IO forwarding options:
       -n     Redirect stdin from /dev/null
       -N     No IO forwarding
       -L     Line buffer output from remote nodes.
       -p     Prefix each line of output with the node number
              it is from. (implies -L)
       -s     Show the output from each node sequentially.
       -d     Print a divider between the output from each
              node. (implies -s)
       -b ##  Set IO buffer size to ## bytes.  This affects the
              maximum line length for line buffered IO. (default=4096)
       -I file
       --stdin file
              Redirect standard in from file on the remote node.
       -O file
       --stdout file
              Redirect standard out to file on the remote node.
       -E file
       --stderr file
              Redirect standard error to file on the remote node.

bpcp: copy files to slave nodes