Using ``dsh`` for System Monitoring
===================================

*dsh* is a component of the clusterit_ toolkit that enables parallel execution
of shell scripts. This can be quite useful for quickly checking the status of
a large number of machines.

Installation
------------

Installing *dsh* and its associated tools is straightforward. After
downloading and extracting the archive contents, follow the usual
*./configure*, *make*, *make install* routine on Linux. On a clean install of
OS X 10.6.5, the following procedure worked::

  ./configure --x-includes=/usr/X11/include --x-libraries=/usr/X11/lib
  make
  sudo make install


Configuration
-------------

*dsh* uses a simple text file at the location defined in the *CLUSTER*
environment variable. The file ``data/cluster.txt`` repeated below contains a
current (late 2010) list of machines. Note that this list should be
dynamically generated from a service database.

.. include:: ../data/cluster.txt
   :literal:


When invoked, dsh will by default execute the specified command on all the
machines defined in *CLUSTER*, which in turn requires authenticating with each
of those machines. Needless to say, some frustration may be alleviated by
setting up public key authentication for each of the machines defined in
*CLUSTER*.


Examples
--------

Uptime on everyone, show any connection errors::

  $ dsh -e uptime
  host-unm-1.dataone.org      :  15:25:47 up 189 days,  5:34,  0 users,  load average: 1.78, 1.56, 1.52
  host-orc-1.dataone.org      :  17:32:42 up 91 days,  4:54,  0 users,  load average: 0.54, 0.36, 0.24
  host-ucsb-1.dataone.org     :  14:14:22 up 114 days, 23:33,  0 users,  load average: 0.80, 0.44, 0.37
  controller-unm-1.dataone.org:  15:26:02 up 3 days,  5:26,  0 users,  load average: 0.04, 0.03, 0.00
  host-unm-2.dataone.org      :  15:25:43 up  3:58,  0 users,  load average: 0.00, 0.00, 0.00
  cn-ucsb-1.dataone.org       :  22:25:08 up 5 days, 21:24,  1 user,  load average: 0.00, 0.00, 0.00
  cn-dev.dataone.org          :  14:25:45 up 13 days, 34 min,  0 users,  load average: 0.00, 0.00, 0.00
  cn-unm-1.dataone.org        :  22:25:46 up 9 days, 22:40,  1 user,  load average: 0.02, 0.07, 0.08
  cn-dev-2.dataone.org        :  16:25:43 up 24 min,  3 users,  load average: 0.00, 0.00, 0.00
  cn-orc-1.dataone.org        :  22:26:13 up 7 days,  5:33,  1 user,  load average: 0.01, 0.03, 0.00
  dev-dryad-mn.dataone.org    :  22:32:31 up 189 days,  5:34,  1 user,  load average: 0.46, 0.11, 0.03
  dev-fedora-mn.dataone.org   : ssh: connect to host dev-fedora-mn.dataone.org port 22: Operation timed out
  daacmn-dev.dataone.org      :  22:32:35 up 91 days,  4:49,  0 users,  load average: 0.00, 0.04, 0.06
  monitor.dataone.org         :  22:25:47 up 189 days,  5:34,  0 users,  load average: 0.20, 0.09, 0.09
  mule1.dataone.org           :  22:25:46 up 189 days,  4:06,  0 users,  load average: 0.00, 0.00, 0.00
  public-web.dataone.org      : ssh: connect to host public-web.dataone.org port 22: Operation timed out
  redmine.dataone.org         :  22:26:03 up 3 days,  5:22,  0 users,  load average: 0.00, 0.00, 0.00
  epad.dataone.org            :  22:25:46 up 3 days,  4:32,  0 users,  load average: 0.00, 0.00, 0.00
  trac.dataone.org            :  14:25:43 up 202 days, 19:19,  1 user,  load average: 0.20, 0.10, 0.07


.. _clusterit: http://www.garbled.net/clusterit.html