Wednesday 19 January 2011

Too many Servers, Not enough eyes - where did all these servers come from?! (3 of 9)

Recently, I had the opportunity to review a report that is regularly put together for senior management at a large financial institution. Nothing in the report references business metrics – metrics that would be of interest (and make sense) to the management of business units and would be useful to the analysts and capacity planners in predicting future server performance and purchasing needs.
In order for large companies to leverage the sheer quantity of performance information available, a reporting methodology must be developed so that analysts can be alerted quickly and accurately when meaningful performance thresholds are exceeded.
Unlike system administrators, capacity planners and performance analysts are not necessarily interested in knowing or responding every time a real-time alert threshold has been breached.
Only severe threshold breaches or series of breaches that indicate sustained performance problems and those breaches that can have potential capacity or long-range performance implications are generally of interest to the performance analyst.
And rather than having to wade through hundreds or thousands of reports each day, exception-based performance and trend reporting is necessary to put the proper level of attention on these systems and applications
Hence, performance and capacity alerting need not happen at real time. Capacity planners and performance analysts are not intended to carry pagers and react at a moment’s notice – not if they are truly fulfilling the traditional capacity and performance management function. When a performance analyst tries also to be the system administrator, there is plenty of evidence that he gets trapped into fire-fighting without the time being made available to prevent the fires in the first place.
Those real-time alerts may be set with the guidance of the performance analyst, but they are intended to be responded to by the system administrator or, ideally, with some automatic-correction scripts that kill runaway processes, failover systems, or load-balance applications in a way that can bring users improved response time without manual intervention.
So, ideally, a methodology should be used where performance and capacity alerts are generated at report time and sent to the analysts in such a way that detailed analyses can be targeted and completed – in order to solve current performance problems and prevent future ones.
A prerequisite for such a methodology is that the organization understands which reports to generate and which performance metrics should have alert thresholds in place in order to provide quick value to management, administrators, and analysts. And based on my experiences, this prerequisite is one that is fairly often missed in a big way.
To download the full version of this blog visit http://www.metron-athene.com/_downloads/_documents/papers/too_many_servers.pdf

Rich Fronheiser
Consultant

No comments:

Post a Comment