------------------------------------
System Load Monitoring Program Notes
------------------------------------
(version 1.1)


The daemons:
------------

sysmond -- system load monitoring (the client)

  Wakes up every xx seconds, determines the date and the most recent
  system load average (using uptime), and sends this info in a UDP
  packet to the server.

Usage for sysmond:

sysmond  [ -h | [-m <machine_name>] [-p <port_number>]
        [-t <sleep_time>] [-d] [-n <number>] ]

Options:
        -h      help (print this information)

        -m      specify machine name of system load collector
                (defaults to "localhost")

        -p      specify port number of system load collector
                (defaults to "6271")

        -t      specify sleep time (in seconds) between system load reporting
                (defaults to "60")

(the other options are for debugging purposes)

        -d      print diagnostic message on each reporting
                (turned off by default)

        -n      number of times to report load averages
                (otherwise, loops forever)


syscold -- system load collector (the server)

  Receives UDP packet system load messages from clients and stores the
  client name, time, and load average information into a log file.
  Whenever the date changes, the log file is renamed, and a child is
  spawned to compute the daily load average for each machine listed in
  the file.  These daily load averages are to be stored in a database.

Usage for syscold:

syscold  [ -h | [-p <port_number>] [-l <log_file>]
        [-o <old_log_file>] [-a] [-u <update_script>] [-d] [-n <number>] ]

Options:
        -h      help (print this information)

        -p      specify port number for receiving system load messages
                (defaults to "6271")

        -l      specify log file for storing system load messages
                (defaults to "log")

        -o      specify name for yesterday's log file
                (defaults to "log.old")

        -a      compute current load averages from log file then exit
                (not the default)

        -u      specify script used to update system load database
                (defaults to "./add-sysload")

(the other options are for debugging purposes)

        -d      print diagnostic message on each system load message received
                (turned off by default)

        -n      number of system load messages to receive before quitting
                (otherwise, loops forever)

(Note that the -u option shows up only when syscold is made "active."
See below for details.)


Control scripts:
----------------

The control scripts for the daemons are named syscol, sysmon, and 
scyld_sysmon, although these names can be redefined.  These scripts can
facilitate starting and stopping the daemons automatically during machine
bootup and shutdown by placing them in /etc/init.d directory and setting
the appropriate links in the run level directories.  They have the virtue
of calling the daemons with a consistent set of command line parameters,
while attempting to prevent a daemon from being started more than once.

The syscol and sysmon scripts accept the following options:

  "start"  ---  Start the daemon using the defined parameters.

  "stop"  ---  Kill the daemon.

The syscol script also accepts the following options:

  "status"  ---  Tells whether or not syscol is active and computes
      the load averages from the current log file contents.

  "dbupdate"  ---  Forces an update to the database (as long as syscold
      has been made "active") with the current log file contents using
      today's date.  However, this also removes the log file in the
      process.

The command line options used to invoke the daemons are built into the
control scripts using definitions from 'Make.options'.


Compiling the daemons and creating the control scripts:
-------------------------------------------------------

The daemons should be compatible with all UNIX systems.  They have been
tested and demonstrated to work on SunOS, IRIX, Digital UNIX (Alpha), 
Linux, and HP-UX.  There is a version of sysmond, called scyld_sysmond,
built automatically for Linux that supports Scyld Beowulf clusters by
parsing the output from 'beostat -l'.

The options for make:

  "make [all]"  ---  (The "all" is optional.)  Builds sysmond and syscold; 
      syscold will output daily load averages to standard output.  Also
      creates control scripts that can be used for system startup.

  "make active"  ---  Builds sysmond and syscold; syscold will execute the
      script to add daily load averages to the system loads database.  Also
      creates control scripts that can be used for system startup.

  "make install"  ---  Installs the daemons and the control scripts.

The object and executable files will be stored in a subdirectory with
the name for that particular architecture.  Type "make help" for the
supported architecture names.

Default configuration options can be modified in defopts.h for sysmond
and syscold.  Definitions used in the control scripts are defined in
'Make.options'.  The options file used can be overridden from the
command line; for example, typing 'make "LOCALOPTS=.local"' and 
'make "LOCALOPTS=.local" install' will use the options defined in the
file 'Make.options.local'.  These options include command line parameters
used when starting the daemons as well as installation locations.


Improvements and bug fixes from the previous version:
-----------------------------------------------------

- One log file is now used for all clients, rather than keeping a
  different log file per client.

- The syscold daemon will flush its log file when given a SIGTERM signal,
  which is the default signal sent by the "kill" command.  Whenever
  syscold is started, it will append to its log file if the file already 
  exists.  This means that syscold now does the right thing so can
  be controlled from an /etc/init.d script.

- There is now a version of sysmond, called scyld_sysmond, that
  supports Scyld Beowulf clusters by parsing the output of 'beostat -l'.
  The load average is computed using all "up" nodes.

- Depending on the sleep time given to sysmond (or scyld_sysmond), the
  5 minute or 15 minute uptime (or 'beostat -l') value might be used.
  If the sleep time is greater than 450 seconds (7 1/2 minutes), the
  15 minute value is used.  If the sleep time is less than or equal to
  450 seconds and greater than 150 seconds (2 1/2 minutes), the 5 minute
  value is used.  Otherwise, the 1 minute value is used.

- The script for updating the system load database can now be given as
  a parameter.

- Getting the system load average (in sysmond) requires forking a child
  to execute "uptime" and redirecting its output through a pipe back to
  the parent.  The stdout stream is now flushed before this fork, just
  in case it hasn't happened yet.  Similarly, stdout is also flushed
  before syscold forks to process load averages from the log file.

- The sendto implementation on Linux for UDP packets fails with errno
  == ECONNREFUSED if no one is there to receive the message, which should
  be treated as non-fatal.  This situation is now handled appropriately.


Known deficiencies:
-------------------

- The syscold daemon will average (using the trapezoid rule) only over
  points present in the log file.  In other words, if a client goes down,
  a load of "0" will not be recorded and there will simply be no data
  logged during the downtime; syscold does not attempt any intelligent
  processing of the raw data.  So the assumption for this averaging is
  that machines are always up or are down only briefly.

- Although the script name for updating the system load database can be
  redefined, the calling syntax for the script is still hard-coded into
  syscol.c:  "scriptname machine avgload maxload date" (with "date" given
  in the form "2001-02-28" and with "maxload" set to the peak reported
  usage for that day).  

- The 'syscol dbupdate' option is really just a kludge ('syscold -n0 ...')
  that not only removes the log file after processing, but will do its
  processing using the current date.  This would be a much better if it
  left the log file alone and allowed the specification of a date.  For
  example, if the host for syscold gets shutdown and isn't restarted
  until the next day, this might then be a way of supplying an average
  for the previous day.

- The code isn't terribly robust yet and probably doesn't do all of the
  error checking that it should, although it is better than the initial
  version release.


Known inefficiencies and suggested improvements:
------------------------------------------------

- Instead of using UDP packets for each message, we might consider the
  network bandwidth to be the valuable resource, and instead make the
  clients log their data locally, then connect to the server once every 
  24 hours using TCP to send the daily load average.

- Even better, since the averaging operation is extremely simple and takes
  very little memory, why not retain everything in memory instead of using
  any local disk space for a temporary log file?  Make the client respond
  gracefully to a kill signal to save its state to a local file.  On
  startup, the client will look for this file to continue where it left
  off.  With this scenario, we can be more intelligent about when to look
  at the date.  This solution would save on *all* resources used.  Each
  client could even call the database update script with its own average,
  which means that there is no server daemon needed.  (Can we trust the
  clients to do this?  Would there be any race conditions if several
  clients try to call the script at the same time?)  The client could even
  analyze its logfile to properly indicate zero values during durations
  when the machine has been down.  Of course, the analysis by syscold
  could also take client downtime into account.  And this approach no
  longer enables a centralized way of checking current daily averages
  with a 'syscol status' command.


Contact information:
--------------------

You can contact the author David Hardy by email to dhardy@ks.uiuc.edu.
