Wednesday, March 12, 2008

Host Availability monitoring: what's the "best" solution?

More often than not, host availability monitoring is more of a philosophy than a solution. Some solutions are simple, some complex. Some can be more system resource intensive than others. Here are some of the more popular solutions Based on my experience, and some pro's and con's for each.

ICMP
In my experience when you get right down to it, most companies are doing nothing more than a ping to check host availability in their environment. ICMP, or echo-request/echo-reply, will tell you that the network allowed an ICMP packet to successfully transverse the network to the monitored device, and that there is something (network card) listening on the other end.

Pros

Lightweight - doesn't task the monitored host (if implemented correctly)

Fast - ping can be tuned for speed, and there are other ICMP implementations out
there (e.g. fping) that are even faster than a traditional ping.

Easy - ease of implementation is important when considering total cost of development

Cons

Unreliable - ICMP is often disallowed across network devices such as routers and firewalls, which may mean you have to engage your network team to allow ICMP if this is your chosen solution.

"False" alarms - since it's just an echo from the network device, often times issues like network latency can cause your monitor to fire even though the host is still up and available. This can be combated with tuning of retries and timeouts. Likewise, the host OS could be unresponsive, but the network card is responsive so the monitor never picks up on the problem. Ensuring folks understand that an ICMP monitoring solution monitors whether the attached network device is reachable via ICMP over the network, not that the host is truly "down", is critical to the success of this solution.

Network friendliness - ICMP implemented the wrong way can cause network problems by flooding the network with ICMP request/reply packets.


Heartbeat
Something such as a script or agent locally on the host sends a "keep alive" to a central server.

Pros

More reliable - More of an indicator that the host is having a problem than ICMP since the script or agent is running on the host OS iteself.

More information - additional metrics can be collected with a script of agent that tells more about the host's current condition during every monitoring cycle than can be collected through a ping.

More control - additional control and tuning is possible through an agent or script that can truely tell you that a host is no longer responding.

Cons

Slower - more complexity takes longer to run at the host. More resources are used locally on the monitored host, and on the central monitoring server that monitors all of the heartbeats. For this reason a heartbeat solution usually can't run at small frequencies like ping.

Harder - there is development involved in creating scripts/configuring agents that is not involved with a simple ping. If you are looking for a quick and dirty solution, this one may not fit the bill.


OS probe
This is something that remotely "probes" the monitored host from a central server to test if the host is responding. Example - remotely issuing a command and waiting for a response.

Pros
Manageability - This can be managed remotely on a central server without distributing code out to the monitored hosts.

Faster than heartbeat - for the most part, this solution will most likely run faster than a traditional heartbeat type solution, depending on how it's implemented.

Flexibility - as with a heartbeat monitor, a probe can be customized to gather information from the monitored host that a simple ping cannot.

Cons

Harder to implement- like a heartbeat solution, the probes have to be developed, and depending on how many OS flavors you are monitoring, multiple probes may have to be created.

Slower - like heartbeat, a probe will be slower than a ping most likely because you are waiting for the host OS to respond to a remote command. Host system resources will impact speed also.


Summary

So given these "top three" host availability monitoring solutions, there are also general considerations. None of them tell you if a network device is down upstream from the host, causing the host to be "unreachable" rather than "unresponsive". To achieve this level of monitoring, events have to be correlated at a higher level, such as TEC, from your network monitoring and host availability moniroting solutions. The good news is this can be done, but not without some major work.

All of this depends on your specific host availability monitoring requirements. Maybe you find that a combination of these fits your requirements and is most efficient, such as a remote ping followed by a remote OS probe as an example.

Hope this article helps. Reply to this article with your host availability monitoring thoughts and experiences.

-Justin



No comments: