Metron - Capacity Management: October 2014

Friday, 31 October 2014

The Systems Development Life Cycle (SDLC) and ITIL (Mind the Gap series, 2 of 10)

Today I’ll be providing a brief review of the Systems Development Life Cycle (SDLC) and ITIL and their relationship.

Each side tends to view the universe from their own point of view without due recognition of the other and the need for coordination.

The SDLC is mentioned in ITIL V3, as is almost every aspect of alternative approaches to IT and Service Management.

The S in SDLC is described as system, software or even service by different authorities.

The description of the precise steps in any project will vary in detail, as will the many approaches to development outlined over the years. Early “waterfall” development was soon improved by increased prototyping and more iterative approaches, with focus enhanced by use of scrum and sprint teamwork.

SDLC introduces its own set of acronyms :

TOR Terms Of Reference

PID Project Initiation Document (or sometimes PIC for Charter)

SPE Software Performance Engineering

Internal systems testing and external pilot site testing

For sites who work this way, any project of more than so many days has to be defined in a project management – project initiation style.

The deliverables used to establish a project are often called a PID or a PIC or some other document containing terms of reference.

However, there’s a need to map this project view into a matrix of management since of equal importance to the service is the infrastructure view of applications which is just as interested in growth of an existing application as it is in the arrival of a new one.

The infrastructure view has gained more impact with the momentum of ITIL. The library is now well known, albeit at a superficial level. The history has been discussed in many places with different levels of authority and memory accuracy. The key things to remember are why it was introduced and to what purpose.

The “centre of IT expertise” for the UK government was aware in the 1980’s of the increasing skills shortage in the public sector and the fact that for each new site, they paid significant money to external consultants to provide what was effectively an operations manual for ITSM. So they gathered together a general description of good practice from a number of sources, with a view to publishing it at no profit.

If the same team had tackled it today it would probably be a free download off the web.

It was meant to be just a general description, independent of hardware, operating system,

software, database, network or any other variables.

As such it was a question of “take it or leave it, adopt and adapt at will” without the implied “correct” answers for which of many processes would tackle which activity within the detailed dataflow definitions for any one site.

It does now carry such a large revenue from foundation training and certification that a whole army of false prophets have raised it to a new gospel-like level in order to drive the material into new areas and new markets. Maybe fragmentation of interests will cause fragmentation in the deliverables…

Next Monday I'll start the week with a look at ITIL’s description of the capacity management process……in the meantime why not sign up to be part of our Community and get access to downloads, white papers and more http://www.metron-athene.com/_downloads/index.html

Adam Grummitt
Distinguished Engineer

Wednesday, 29 October 2014

Mind the Gap series(1 of 10)

There is an increased awareness of the need for governance of ITSM processes.

This is promulgated by the management consultancies offering extended audits to enterprises to check that their processes are reasonable in comparison with their peers and with good practice within their industry.

Invariably such audits are usually more of a major attitude survey based on extensive interviews with an arbitrary scoring method than a detailed technical audit. People are given a mini-lecture on Capability Maturity Model Integration (CMMI) and then asked to rate themselves and their peers with an arbitrary score of 1 to 5.

These scores are faithfully recorded and aggregated and analysed and wondrous graphs produced.

Such surveys add to their own bad name by producing Kiviat diagrams showing scores of different aspects to unrealistic accuracy and with a false assumption that the academic attainment of all that is described in ITIL is necessarily a Good Thing in all cases.

Each activity varies for each service during its own life cycle, as well as in the light of changing company circumstances.

Each activity may or may not be appropriate for each service, server or whatever, depending on its importance to the business, its cost and its reliability.

What most sites already well immersed in IT Service Management (ITSM) find is a more valuable service is a consultancy review of their actual processes, why they have made whatever decisions they have, and an indication of where there are risks, issues or dangers.

So rather than say a process is mature with a snapshot score of 3.571, it is more useful to say that although performance reports are published to the web, there are few hits on that site from outside the performance team members who publish it.
A manager is more inclined to say although the process is well in place, there is a need to discuss some of the known unknowns and also to reveal any unknown unknowns.

This blog is based on a number of such gap analyses. Although arbitrary scoring is eschewed, there is still a need to start with an outline template of what is generally considered to be Good Practice and that is usually based on ITIL as a starter with pragmatic experience adding a lot more practical detail.

It’s important that the company culture is recognised and that the objectives of the

service are appreciated before stating that any of the usual practices are vital for that site.

It seems remarkable at first that every site is so different in its detailed implementation of an IT development and infrastructure environment, but then each one is a living organism and has huge opportunities for variation.

For the purposes of this blog series most of my analyses will clearly finish with a SWOT (strengths,weaknesses,opportunities and threats) and outline of next steps.

The examples I’ve chosen to talk about are typical of the hard pressed retail sector where the economic downturn is having an impact on everything.

So, rather than capacity management defining what equipment is needed, it’s more likely to be told it will have 10% less money to provide the same services including double the number of users (due to mergers and acquisitions) with the same hardware and 10% less staff!

Be frugal, do things just in time, but make sure that the mission critical services continue to be supported with high availability and good performance.

On Friday I’ll begin with a brief review of the Systems Development Life Cycle (SDLC) and ITIL and their relationship.

Adam Grummitt

Distinguished Engineer

Friday, 24 October 2014

What is wrong with this statement? We don’t need Capacity Management

Not so long ago I was talking with the CIO of a multi-billion financial organization. They offered me the following statement about Capacity Management at their organization:

‘We don’t need Capacity Management. We’ve just purchased more than enough capacity from IBM to see us through at least the next three years.’

They were talking about distributed systems: Unix and Windows, physical and virtual, not mainframe.

As a lifelong worker in the field of Capacity Management, my heart sank (again!). It was far from the first time I have heard this, and I know that talking a manager out of this approach is never easy. So I thought I would think again about why I thought this statement was so wrong. Here are the questions that came to my mind.

What is wrong with this statement?

‘don’t need Capacity Management’

- This could be true, perhaps anyone can get everything right by guesswork, but is ‘finger in the air’ an approach you would want to defend to your management or external audit?

‘purchased’

- Have you got a ‘get out’, given changing IT models and purchasing scenarios: Cloud, SaaS…?

‘more than enough capacity’

- How do you know?

- Why not just buy enough, rather than ‘more than enough’?

‘from IBM’

- I love ‘em, but is selling this much capacity really the ‘partnership’ that larger vendors profess to offer?

‘to see us through ‘

- Is this the same as ‘guarantee acceptable service levels aligned to changing business requirements to maximize our contribution to organizational effectiveness over a three year period?

‘at least the next three years ‘

- How do you know it is enough for three years+?

- Could the money spent on capacity you won’t need for one or two years have been spent on something else that would bring more benefit to the business in the interim?

- How much more kit could you have got for the same money if you had bought it when you required rather than up to three years before it is needed?

Of course there are two sides to every coin, so….

What is right with this statement?

- Well, at least they had thought about capacity.

Making the statement suggests, no matter how alarmed I might be about the logic underpinning it, that they had done ‘Capacity Management’. Even if it was just asking IBM what was needed or sticking a finger in the air and guessing, they were contradicting themselves: they had done some Capacity Management so they must have perceived a need for it. My task was just to get them to think about the balance between what was wrong with their statement and what was right.

Are there any other ‘wrongs’ and ‘rights’ ‘with the statement that you would like to offer?

Andrew Smith

CEO

Tuesday, 21 October 2014

How to save BIG MONEY with Capacity Management

In today’s tough economic climate the need for capacity management expertise is ever more important. Even with the relatively low cost of hardware, there are still considerable benefits for businesses in using solid capacity planning.

I'm running a webinar tomorrow which looks at how a relatively short consultancy engagement was able to save big money in deferred purchases.

Metron Professional Services (MPS) were asked to perform a capacity planning exercise on a business critical system for a large financial services organization. The organization was planning to replace the hardware with the latest generation (for this read, very expensive) servers and the latest database technology.

MPS was involved from the very beginning of the project from reviewing the impact of the upgrade to making recommendations on the hardware specification.

I'm going to be talking you through the project in detail and will be covering the following:

Data Capture
Problems encountered with load testing
How analytical modeling was used
The results and conclusion

There’s still time to register for this event so why not come along and get some ideas on how you could save money for your Organization.

http://www.metron-athene.com/services/training/webinars/index.html

Jamie Baker

Principal Consultant

Tuesday, 14 October 2014

The wheat and the chaff – do you really need all those capacity metrics?

For a Capacity Manager or Analyst, there is a mass of data typically collected, but very little that ever needs to go forward to others. At a component level, many organizations now have Capacity Management for thousands of servers, petabytes of storage and more, vested in the hands of just one Capacity Analyst. Pulling all this together and selecting the ‘zingers’ to pass forward to management is a monumental challenge.

One capacity study we were involved in as consultants covered the merger of two organizations each with revenue in excess of $5 billion. Multiple data centers, thousands of servers and a request from management to evaluate their plans to merge systems and reduce the hardware estate. Management wanted to know when in the migration plan there were pressure points where resources might be stretched and additional investment required. Their preference was for this to be communicated within a broader presentation covering other issues, and be communicated in just one PowerPoint slide. It’s not an unreasonable request: tell me when a decision needs to be made, and what that decision should be.

The quality and skill of those involved in Capacity Management never ceases to amaze me. Answering demands like the one above is regularly achieved. For us, the key was understanding the end point. Once we knew what they wanted on that one PowerPoint slide, we could focus our efforts much more precisely on what data we needed to get and how we needed to use it.

We would of course recommend that the wheat be separated from the chaff through third party products like Metron’s athene®. For many businesses, that is a step they have not yet taken. Questions are addressed by patient manipulation of raw data and use of tools like Excel. One organization we spoke to recently had Excel macros producing over 5,000 individual Excel spreadsheets every week, just for their UNIX systems. Then there was Linux, VMware, Windows…..

Whatever works is fine, although each approach has its risks. In particular extensive home grown routines are susceptible to those who have created and know how to maintain them moving on to pastures new. Third party solutions at least have the benefit of external support enabling easier transition of reporting and analysis regimes to new staff.

Books have been written on how to manage all this data. Whatever the approach, there are a couple of key things that are worth remembering:

· Always understand what the question is that you need to answer – capacity reporting isn’t an end in itself

· Don’t waste time on irrelevant data – if you can’t see how something is going to be used, consider

o Only capturing data that you know you need to use

o Capturing all data you think you might need but only collecting/processing/reporting on data you actually need now

One thought might be to take the popular approach to desk/office management. If you have something on your desk that you haven’t referred to for six months or whatever time period you consider appropriate, get rid of it.

For Capacity Management, if you have data metrics stored that you haven’t looked at, stop processing those metrics into reports. You immediately save the time, effort and cost associated with processing those metrics and maintaining reports based on them. If you at least keep the captured data for a period without processing or reporting on it, as you can easily do with athene®, you can always process that data if and when a need subsequently arises..

Andrew Smith

CEO

Friday, 10 October 2014

VMware Reports and Summary (17 of 17) Capacity Management, Telling the Story

As promised I’ll show you one final report on Vmware, which looks at headroom available in the Virtual Center.

On the example below we’re showing CPU usage. The average CPU usage is illustrated by the green bars, the light blue represents the amount of CPU available across this particular host and the dark blue line is the total CPU power available.

VMware – Virtual Center Headroom

We have aggregated all the hosts up within the cluster to see this information.

We can see from the green area at the bottom how much headroom we have to the blue line at the top, although actually in this case we will be comparing it to the turquoise area as this is the amount of CPU available for the VM’s. This is due to the headroom taken by VMkernel which has to be taken in to consideration and explains the difference between the dark blue line and the turquoise area.

To summarize my series when reporting:

• Stick to the facts

• Elevator talk

• Show as much information as needs to be shown

• Display the information appropriate for the audience

• Talk the language for the audience

….Tell the Story

I hope you've found my series useful and you can get access to this and other white papers by joining our Community http://www.metron-athene.com/_downloads/index.html

Charles Johnson

Principal Consultant

Wednesday, 8 October 2014

VMware Reports (16 of 17) Capacity Management, Telling the Story

Today we'll take a look at some examples of VMware reports.

The first report below looks at the CPU usage of clusters in MHz. It is a simple chart and this makes it very easy for your audience to understand.

VMware – CPU Usage all Clusters

You can immediately see who the biggest user of the CPU is, Core site 01.

The next example is a trend report on VMware resource pool memory usage.

VMware – Resource Pool Memory Usage Trend

The light blue indicates the amount of memory reserved and the dark blue line indicates the amount of memory used within that reservation. This information is then trended going forward, allowing you to see at which point in time the required memory is going to exceed the memory reservation.

A trend report like this is useful as an early warning system, you know when problems are likely to ensue and can do something to resolve this before it becomes an issue.

We need to keep ahead of the game and setting up simple but effective reports, provided automatically, will help you to do this and to report back to the business regarding requirements well in advance.

On Friday I’ll show you one final report on VMware, which looks at headroom available in the Virtual Center and then summarize my series.

Don't forget to register for our 'How to save big money with capacity management'

http://www.metron-athene.com/services/training/webinars/index.html

Charles Johnson

Principal Consultant

Monday, 6 October 2014

Model – Linux server change & disk change (15 of 17) Capacity Management, Telling the Story

Following on from Friday we would next want to show the model for change in our hardware.

In the report below in the top left hand corner we are showing that once we reach the ‘pain’ point and then make a hardware upgrade the CPU utilization drops back to within acceptable boundaries for the period going forward.

In the bottom left hand corner you can see from the primary results analysis that the upgrade would mean that the distribution of work is more evenly spread now.

The model in the top right hand corner has bought up an issue on device utilization with another disk so we would have to factor in an I/O change and see what the results of that would be and so on.

In the bottom right hand corner we can see that the service level has been fine for a couple of periods and then it is in trouble again, caused by the I/O issue.Whilst this hardware upgrade would satisfy our CPU bottleneck it would not rectify the issue with I/O, so we would also need to upgrade our disks.

When forecasting modeling helps you to make and show recommendations on changes that will be required and when they will need to be implemented.

On Wednesday I'll take a look at some examples of VMware reports and how we can get our message across using simple reports.

Charles Johnson

Principal Consultant

Friday, 3 October 2014

Modeling Scenario (14 of 17) Capacity Management, Telling the Story

I have talked about bringing your KPI’s, resource and business data in to a CMIS and about using that data to produce reports in a clear, concise and understandable way.

Let’s now take a look at some analytical modeling examples, based on forecasts which were given to us by the business.

Below is an example of an Oracle box, we have been told by the business that we are going to grow at a steady rate of 10% per month for the next 12 months. We can model to see what the impact of that business growth will be on our Oracle system.

In the top left hand corner is our projected CPU utilization and on the far left of that graph is our baseline. You can see that over a few months we begin to go through our alarms and our thresholds pretty quickly.

In the bottom left hand corner we can see where bottlenecks will be reached indicated by the large red bars which indicate CPU queuing.

On the top right graph we can see our projected device utilization for our busiest disk and we can see that within 4 to 5 months it is also breaching our alarms and thresholds.

Collectively these models are telling us that we are going to run in to problem with CPU and I/O.

In the bottom right hand graph is our projected relative service level for this application. In this example we started the baseline off at 1 second, this is key. By normalizing the baseline at 1 second it is very easy for your audience to see the effect that these changes are likely to have. In this case, once we’ve added the extra workload we can see that we go from 1 second to 1.5 seconds (a 50% increase) and then jumped from 1 second to almost 5 seconds. From 1 to 5 seconds is a huge increase and one that your audience can immediately grasp and understand the impact of.

We would next want to show the model for change in our hardware and I'll deal with that on Monday.

In the meantime sign up to come along to our next free webinar 'How to save big money with capacity management'

http://www.metron-athene.com/services/training/webinars/index.html

Charles Johnson

Principal Consultant

Wednesday, 1 October 2014

Business Metric Correlation (13 of 17) Capacity Management, Telling the Story

As mentioned previously it is important to get business information in to the CMIS to enable us to perform some correlations.

In the example below we have taken business data and taken component data and we can now report on this together to see if there is some kind of correlation.

In this example we can see that the number of customer transactions (shown in dark blue) reasonably correlates with the amount of CPU utilization.

Can we make some kind of judgment based on just what we see here? Do we need to perform some further statistical analysis on this data? What is the correlation co-efficiency for our application data against the CPU utilization?

Closer to the value of 1 indicates that there is a very close correlation between the application data and the underlying component data.

What can we do with this information back to the business?

An example would be: This graph indicates that there is a very close correlation between the number of customer transactions and the CPU utilization. Therefore, if we plan on increasing the number of customer transactions in the future we are likely to need to do a CPU upgrade to cope with that demand.

On Friday I'll take a look at some analytical modeling examples, based on forecasts given to us by the business and how we can use modeling to show the business the likely impact of forthcoming initiatives.

Charles Johnson

Principal Consultant