Thursday, January 8, 2009

System Downtime Monitoring Using Universal Agent

ITM provides system uptime monitoring out-of-the box.  You just have to select the uptime attribute and you can use the attribute in situation formula. It is that simple. Okay, what if you want to monitor the system downtime? It may sound like little difficult but with Universal Agent it is possible to guage this value with a simple MDL and script combination.  This article explains how to do it. It also provides an example how to use some of the Time functions of UA.  

How does it work?
To calculate downtime, we need to write a simple script that outputs the current date and time and the previous value of the date and time. We use the UA to get the capture these values.  So, how do we get the previous value? UA provides this feature out-of-the box and it is documented!  Just use the environment variable $PREV_VALUE in your script. Unfortunately this value is not persistent across UA restarts, so your script should store the last time it ran in a file somewhere.  You can also use the UA functions to convert the script output to ITM timestamp. So after this,  you will get Current Time, the previous time the script ran as attributes in the portal. You can write a simple situation that uses Time Delta function to calculate the difference between the two times and alert.  

MDL
A Simple MDL listing is given below. It is given as an example only. Perform your own testing to ensure its working. 

//APPL V02_SYSTEM_DOWNTIME
//NAME DOWNTIME K 300 AddTimeStamp Interval=60
//SOURCE SCRIPT /opt/gbs/bin/downtime.sh 
//ATTRIBUTES 
Hostname (GetEnvValue = HOSTNAME)
CurrentDate D 10
CurrentTime D 10
PrevDate D 10
PrevTime D 10
CurrentDateTime (CurrentDate + CurrentTime)
PrevDateTime (PrevDate + PrevTime)
CurrentTimeStamp (TivoliTimeStamp = CurrentDateTime)
PrevTimeStamp (TivoliTimestamp = PrevDateTime)

Script

Here is a sample shell script that retrieves current and prev time stamp values. 

#!/bin/sh

# Latest Timestamp
current_value=`date "+%m/%d/%Y %H:%M:%S"`

# If the PREV_VALUE exists, displays current and prev values,
# else retrieve PREV_VALUE from persistent file

if [ "x$PREV_VALUE" != 'x' ]
then
   echo $current_value $PREV_VALUE
else
   prev_value=`cat /tmp/downtime.txt`
   echo $current_value $prev_value
fi

# finally, store the current timestamp in persistent file
echo $current_value > /tmp/downtime.txt 

Drawbacks

Does this solution provide accurate downtime estimate? No, it doesn't. For example you may get potential false alerts if the UA goes down for some reason. Also, it provides the time difference between the script last ran and script's latest run not exactly the time between system reboots. But these are minor drawbacks to live with! 

Questions, comments? Please feel free to post them.  

1 comment:

Thilo Mohri said...

Hi,

nice script, i'm currently working on a script which does the same on windows boxes via wmi.

Nevertheless there is one big drawdown with your script: depending on the operation system /tmp is cleared upon reboot, so better change this to a static path or just store it in the root filesystem.

Regards,
Thilo
blog.tmohri.de