Metron - Capacity Management: January 2011

Monday, 31 January 2011

Too many servers not enough eyes - where did all these servers come from?!(part 8 of 9)

Business-critical and high profile applications typically receive the most attention from the capacity planners and performance analysts. While those applications will continue to receive a lot of analyst and modeler attention, remaining modeler cycles can be targeted to those systems that appear to need detailed analysis.

As a result of building and viewing trend exception reports and trend alerts, data center and business unit management can better prioritize modeling efforts to target the systems that are likely to require upgrades or the shifting of workloads in order to meet acceptable service levels in the future.

The process of analytic modeling includes selecting appropriate modeling intervals, building baseline models, calibrating the models, and developing what-if scenarios. Many existing books and CMG papers have been written on the subject.

Modeling, however, is not simply applying a favored modeling tool to a randomly chosen set of data. The key pieces of applying modeling techniques include knowing what modeling can (and cannot) provide and having (or having available) in-depth knowledge of the application along with a business metric of interest that’s going to be the focal point of the modeling study. I’ve trained many modelers using multiple modeling tools and I always have the same message – a tool will never replace your knowledge, your experience, your “feel” for how to complete a study. Performance management and capacity planning is a mindset and successful modelers have that mindset – knowing what information is important and what information isn’t – and also knowing how much fine tuning is necessary in the modeling process. Having a reasonably accurate answer in a couple of hours is in many instances preferable to having a pinpoint accurate answer that takes a few weeks or even a few days.

To download the full version of this blog visit http://www.metron-athene.com/_downloads/_documents/papers/too_many_servers.pdf

Rich Fronheiser
Consultant

Friday, 28 January 2011

Too many servers not enough eyes or where did all these servers come from?!(7 of 9)

The main purpose of including trending in the reporting and analysis structure is to alert the analysts and the business unit that the potential for performance problems exist based on prior trends. Analysts must choose metrics where linear trending is appropriate: Linear behaviors such as utilizations are great candidates for trending, but non-linear metrics where contention and queuing have a much bigger impact, such as response-times and throughputs, are better candidates for analytic modeling.

In general, analytic models are far more accurate when looking at an overall system or application than trending, but a staff of a few people cannot adequately build, maintain, and update models for hundreds or thousands of systems and applications. Building, editing, calibrating, and evaluating models is a very interactive process and not one that lends itself well to automation. And because of the size of today’s data centers and the relatively small size of the analyst staffs, a level of automation is necessary in order to reduce the number of servers and applications down to a reasonable size for detailed analysis and modeling.

Alerts should be sent far enough in advance so that the analysts can easily analyze data, refine workload characterizations, and build models that can be used to plan needed upgrades or workload moves

Workloads and hardware configurations can change quite rapidly for both business and technological reasons and leave long-term trend reports quite meaningless. Once trend reports are built, an effort should be made to normalize the metrics trended so that changes to architecture or the addition of large chunks of end-users won’t skew the trended data.

This could be accomplished by reporting performance numbers that incorporate the number of users (CPU seconds per user, for example), or via incorporating a common hardware benchmark that can be used to normalize transactions across systems of different sizes.

Typically, the best way to normalize workloads from one environment to another is via the use of an analytic modeling tool and the “what-if” analysis it provides. A good place to improve the world of trend reporting and alerting would be a vendor tool that could provide some “intelligent” trending that would make some of these “what-if” changes to trend sets that would keep valid many of the trend reports built – even when configurations and workload levels change in the system or application.

To download the full version of this blog visit http://www.metron-athene.com/_downloads/_documents/papers/too_many_servers.pdf

Rich Fronheiser
Consultant

Wednesday, 26 January 2011

Too many servers not enough eyes - or where did all these servers come from?!(6 of 9)

Choosing the right data to trend

Part of the building of trend reports and trend alerts must include a detailed analysis of the business patterns, both from the data collected and by talking with the business units.

Include peak periods in your trend sets, but exclude data that would eliminate the benefits of trending – exclude days where the server may sit idle such as holidays or weekends and activity that has nothing to do with the application itself. Try to identify and allow for seasonality to derive the underlying trend – a trend that may need to be prepared for separately from other trending efforts.

Take advantage of workload characterizations – building trend sets of metrics reflecting just the applications running on the system can provide a much more granular view and can show the analyst and management much more quickly when there are changes in application volumes.

And finally, seek out metrics coming from the business units themselves. Many business units (or even the applications that they run) can provide useful business numbers, for example, the number of transactions per hour, the number of web hits, or the number of telephone calls handled by the customer service function.

Trending those values along with performance metrics and making a connection between the numbers the business people can understand and the numbers the analysts understand can potentially reduce some of the disconnect between the data center and the business units.

Instead of reporting that performance will suffer once a certain trended metric or set of metrics hit certain performance values, try to report back to the business units in the business terms they understand, such as: “Performance will start to suffer in the XYZ application once you handle more than 280 calls/hour.” Business unit management will not only understand better, but they will also be able to see the urgency of ensuring that capacity is adequate for that eventual business load.

To download the full version of this blog visit http://www.metron-athene.com/_downloads/_documents/papers/too_many_servers.pdf

Rich Fronheiser
Consultant

Monday, 24 January 2011

Too many servers not enough eyes - where did all these servers come from?! (5 of 9)

Linear trending, when applied against metrics where trending makes sense and in combination with effective filtering, reporting, and alerting, can be used to reduce the number of servers and applications down to a more manageable size where detailed analysis and modeling is required.

Typically, most problems with linear trending have to do with how trending is used by the analyst, and the way that the data is chosen and put into a trend set. It may be beneficial to talk a bit about trending and setting up an effective trend set.

Choosing poor metrics for trending can doom the effort of setting up trend reports and trend alerts.

For example, trending the amount of file space free on a physical or logical disk can be a useful task in many cases (such as on a volume that stores customer records in a database); however, if large chunks of data are moved to and from the disk at irregular intervals, the trended value could be fairly meaningless – the file system could be almost full but because of the irregular nature of the moving of data, the trend might be downward.

For example, at a company I used to work for, the capacity planning group had a file system dedicated to storing performance data. The data collection technology we used was supposed to clean up the file system after the data had been transferred back to a central data repository.

When the routine didn’t clean up the data adequately, the file system would fill and an analyst would have to manually clean up the data from the remote systems.

There would be no way to build a useful trend set for such a process and alert based on trends, as the file system would fill within a week and once cleaned up, would be almost entirely empty.

Many metrics like this that are used by some analysts in trend reporting and analysis are better served by real-time or performance alerts with meaningful thresholds set to alert administrators and analysts of potential poor conditions.

Looking back at the company from a decade ago, trends were built on CPU Utilization for all of the systems on the floor. Many of these systems had applications that ran batch jobs during the night. Yet the analysts still put together graphs and charts indicating when a trended CPU Utilization number exceeded a certain threshold. On systems running large amounts of batch (or backup) work, this activity is meaningless at best, misleading at worst. Typically, during the batch window, management and analysts are more interested in throughput and the length, in time, of the batch window.

Rather than trending CPU Utilization, trending the length of the batch (or backup) window would be a more useful measure.

Since performance concerns dictate that batch (or backup) jobs run during the night complete before interactive, transaction-processing work begins in the morning, knowing when those batch jobs will interfere with other users is quite useful – and an appropriate use of trending.

To download the full version of this blog visit http://www.metron-athene.com/_downloads/_documents/papers/too_many_servers.pdf

Rich Fronheiser
Consultant

Friday, 21 January 2011

Too many servers not enough eyes - where did all these servers come from?! (4of 9)

Before leaving the area of performance alerting let’s delve into trending and analytical modelling and consider another important use of “capacity alerting.”

Most analysts and administrators focus intently on applications that are performing poorly or have the potential to do so in the near-term – as they should. But many I/S divisions are focusing on cost-containment and are trying to return as much value to the business units as possible.

Creating alerts that are generated when the system is likely underutilized can alert management to opportunities to undertake server consolidation activities and potentially reduce the cost of hardware, software, and support dollars as well.

Building such alerts may not be popular with everyone in the company – many professionals find servers that are over-configured and performing well to be an acceptable, if not exactly ideal, situation.

However, many of those professionals are judged on how well the application and system performs, not how well the overall capacity and dollar outlays are managed and controlled.

Since capacity planning is typically defined as providing “consistent and acceptable service at a controlled cost, it is just as important to focus in on controlling the cost of too much spare capacity as it is on providing the consistent and acceptable levels of service.

If this is not clear within the typical IT department, there is no doubt that other departments within the corporation will resent money being spent on redundant equipment rather than other investments they have on their list.

To download the full version of this blog visit http://www.metron-athene.com/_downloads/_documents/papers/too_many_servers.pdf

Rich Fronheiser
Consultant

Wednesday, 19 January 2011

Too many Servers, Not enough eyes - where did all these servers come from?! (3 of 9)

Recently, I had the opportunity to review a report that is regularly put together for senior management at a large financial institution. Nothing in the report references business metrics – metrics that would be of interest (and make sense) to the management of business units and would be useful to the analysts and capacity planners in predicting future server performance and purchasing needs.

In order for large companies to leverage the sheer quantity of performance information available, a reporting methodology must be developed so that analysts can be alerted quickly and accurately when meaningful performance thresholds are exceeded.

Unlike system administrators, capacity planners and performance analysts are not necessarily interested in knowing or responding every time a real-time alert threshold has been breached.

Only severe threshold breaches or series of breaches that indicate sustained performance problems and those breaches that can have potential capacity or long-range performance implications are generally of interest to the performance analyst.

And rather than having to wade through hundreds or thousands of reports each day, exception-based performance and trend reporting is necessary to put the proper level of attention on these systems and applications

Hence, performance and capacity alerting need not happen at real time. Capacity planners and performance analysts are not intended to carry pagers and react at a moment’s notice – not if they are truly fulfilling the traditional capacity and performance management function. When a performance analyst tries also to be the system administrator, there is plenty of evidence that he gets trapped into fire-fighting without the time being made available to prevent the fires in the first place.

Those real-time alerts may be set with the guidance of the performance analyst, but they are intended to be responded to by the system administrator or, ideally, with some automatic-correction scripts that kill runaway processes, failover systems, or load-balance applications in a way that can bring users improved response time without manual intervention.

So, ideally, a methodology should be used where performance and capacity alerts are generated at report time and sent to the analysts in such a way that detailed analyses can be targeted and completed – in order to solve current performance problems and prevent future ones.

A prerequisite for such a methodology is that the organization understands which reports to generate and which performance metrics should have alert thresholds in place in order to provide quick value to management, administrators, and analysts. And based on my experiences, this prerequisite is one that is fairly often missed in a big way.

To download the full version of this blog visit http://www.metron-athene.com/_downloads/_documents/papers/too_many_servers.pdf

Rich Fronheiser
Consultant

Monday, 17 January 2011

Too many servers not enough eyes - where did all these servers come from? (2 of 9)

Many distributed environments have hundreds or thousands of servers, not dozens. And some applications, or suites of applications, have dozens of servers.

Many environments have large clusters, or pools, of servers for added computing power, failover, or load balancing purposes.

Data centers now have hundreds, or thousands of employees, charged with writing homegrown applications, administering server farms, and managing projects of increasing complexity.

I recently worked in an environment with almost a thousand distributed servers, with about 75% of those in the production tier.

At least 300 of those servers are various flavors of Unix, with the rest being various releases of Microsoft Windows.

Many of the applications in that environment are distributed across multiple tiers, including the mainframe, and many of the applications are complex, web-based applications that require a lot of specialized knowledge to administer.

So, considering the huge increase in the number of servers and the increasing complexity of the distributed applications, the number of analysts responsible for daily reporting, analysis, and capacity planning changed as well.

Indeed, the number decreased – from 5 to 4. Instead of each analyst being responsible for about ten servers, each analyst was responsible for hundreds of servers and dozens of applications.

Even with report automation, today’s analysts are not able to dedicate full days to analyzing performance data considering the number of systems and applications. So even though detailed performance reports are automated in large environments, in many cases customer complaints and telephone calls from perplexed system administrators are the catalysts for detailed analyses.

Undetected problems such as memory leaks and looping processes can lurk for weeks or months until these issues manifest as response time or throughput issues noticeable by the end-users.

Once called, the performance analyst can drill down and find the problem within minutes, but that doesn’t help keep the problem from occurring.

The disconnect between the data center and the business units still exists in many environments, and is most frequently and best seen in reports that are prepared for management.

More to come on wednesday - keep following.....

To download the full version of this blog visit http://www.metron-athene.com/_downloads/_documents/papers/too_many_servers.pdf

Rich Fronheiser
Consultant

Friday, 14 January 2011

Too many Servers, not enough eyes - where did all these servers come from???! (1 of 9)

Based on my experiences in formal Capacity Planning and Performance Management departments – starting with a large company that had few servers and many analysts and concluding with an even larger company that better reflects the title.I'm going to outline the methodology that allows analysts and planners to target their limited resources at systems (and more importantly, applications) that require the most attention as the number of servers grows and the number of analysts decreases.
Follow my series, which starts today..........

In the past decade, the size of server farms has increased dramatically, while the numbers of performance analysts and capacity planners have remained steady or decreased. In order to manage this environment and ensure that analysts look at the applications that either have or are likely to exhibit performance problems, there needs to be a plan to reduce the number of applications targeted for analysis that includes: Automation of all processes as much as practical; Exploitation of computational intelligence to interpret results and do “smart” trending; Automated event management and alerting; Interactive analysis, for drill-down and model building;Analytic modeling of servers and applications.

Building a methodology around these points will help understaffed data centers and overworked analyst staff make their workloads more manageable, provide proactive support to data center management, and more importantly, to the business units they ultimately support.

When business managers used to ask for details on why performance was poor, analysts would write highly technical documents that described servers in terms of resource consumption using metrics and terms that no business manager could possibly be expected to understand. For reasons more related to pride than understanding, the business managers asked few questions and generally accepted the explanations and recommendations of the technical staff, even though their real questions generally went unanswered.

Notice also that they referenced servers, not applications, and certainly not end-to-end user experience or response times. A decade ago, many of the applications running on distributed systems ran on one, or at most a few, servers. Rather than consider applications as a whole and the end-user experience, analysts would look at reports for individual servers and take a very large step by theorizing that if all the servers have “reasonable” resource consumption numbers that the application likely had acceptable performance.

This server-focused mentality only added to the disconnect that existed between the data center management and the business unit management.

To download the full version of this blog visit http://www.metron-athene.com/_downloads/_documents/papers/too_many_servers.pdf

Rich Fronheiser
Consultant

Monday, 10 January 2011

VMware vSphere - what happens now?

So your company has invested in VMware and your Physical to Virtual conversions are either underway or complete. So what happens now? Just let them run and pray there are no performance or capacity issues?

I’m hosting a new VMware vSphere “Now the Dust is Settling” webinar on January 13^th so why not come along to understand how Athene can assist you to answer such important questions as:

· What CPU and Memory metrics should be monitored both at the Host and Guest level?

· What does CPU Ready Time actually mean and how does it affect performance of a Virtual Machine?

· VMware can over commit Memory, but how and what warning signs should I be looking for?

· Guest Disk and Datastore Capacity, what can I do if I am running out of space?

· VM Sprawl, how many of my VM’s are Idle, Oversized and Undersized?

plus many more . . .

http://www.metron-athene.com/training/webinars/webinar_summaries.html

Hope to see you there

Jamie Baker

Consultant

Thursday, 6 January 2011

Changing the face of Capacity Management solutions

Firstly, let me wish you a Happy New Year.... So what does the New Year have in store for Athene?

As ever we’ll be extending the range of metrics we provide. For example there will be additional performance metrics for Oracle, z/VM and Exchange 2010. The icing on the cake however will be the introduction of Integrator.

But what is the Integrator - it sounds like something from an Arnold Schwarzenegger movie doesn’t it?

Well, Integrator is part of our Cloud Computing strategy along with other elements such as our collaboration with end to end transaction measurement specialists, Correlsense. With technologies such as virtualization and cloud computing high on every company’s agenda the need to manage capacity is now becoming a necessity. In a recent Datacenter study Gartner recommended using one single tool for all your capacity management requirements....but does such a tool exist?

Yes it does – our Athene software. The exciting thing about the addition of Integrator to the Athene suite is that it enables the import of data, in any format, from any source, into the core Athene Performance Database (PDB). Yes, you read that right – you’ll be able to import data in any format, from any source.

The ability to do this is going to prove invaluable to those of you who are trying to effectively capacity manage complex environments spanning a mixture of internal, external and hybrid clouds alongside your physical and virtual estates – you can even add in your mainframe, networks, power....

What you can use to gather performance and capacity data will not always be under your control in the Cloud so something that can pull together any available data and integrate it with all your reporting and planning initiatives will be essential.

With Athene you will only need to use one tool for all your capacity management requirements.

Now for those of you who are interested in how it works....

The data to be imported is described by Integrator Templates, where a key-hierarchy of keys and data fields can be specified. This template, created using a browser based interface, is then automatically converted to Athene Control Center templates which are then processed into database tables, ready for use by applications within the Athene suite.

With the data being integrated into the Athene PDB it can also be backed up as part of a standard SQL Server back-up schedule.

More to come on this later in 2011 – I’ll keep you posted....

Dave Watson
SVP, Development

Saturday, 1 January 2011

Happy New Year from our CEO

Welcome back after the break.

I hope that you and your families have enjoyed the holidays and that everyone is refreshed and ready to face the year ahead.

We are continuing to expand our Development and Testing teams throughout 2011 and along with the adoption of agile development this is enabling the cross fertilisation of ideas for Athene, resulting in an ever adapting Capacity Management solution for you.

Which brings me nicely in to the fact that this year will see the release of Athene 8.7 with some exciting new features and we’ll be giving you more details on this soon.

Follow our blog in the coming weeks and months as there will be an array of interesting and engaging topics being discussed and more dates for our free interactive webinars.

It just remains for me to wish you all a happy and prosperous 2011.

Paul Malton

Chief Executive Officer