Monday, 24 November 2014

IT Capacity Management


Capacity Management has recently worked its way up the list of concerns that senior IT managers have. 
Gartner says, "By 2016, the availability of capacity and performance management skills for horizontally scaled architectures will be a major constraint or risk to growth for 80 percent of major businesses.”

This renewed focus on Capacity Management has come about from the introduction of new technologies such as Virtualization and Cloud Computing. 

To achieve the promised cost objectives of Virtualization and Cloud infrastructure, Capacity Management needs to be in place and working properly. 

With traditional distributed architectures many companies have essentially “winged it” - this now comes with greater risks, to both doing business and achieving budgets, than it ever did before. If you're going to put all your eggs in one basket, it had better be big enough to hold them...

For all those of you who have ever wondered what the aim of IT Capacity Management is, What the Capacity Manager actually does and how Capacity Management fits with other IT functions I'm running a session to answer these questions and more.

The topics I’ll be covering include:
  • Goals of Capacity Management
  • How to implement Capacity Management
  • The Mechanics of Capacity Management
  • Where Capacity and Other Processes interact
Registration for this webinar 'IT Capacity Management 101' is now open, so sign up to come along http://www.metron-athene.com/services/training/webinars/index.html

Phil Bell
Principal Consultant

Wednesday, 19 November 2014

SWOT Analysis ( Mind the Gap series 10 of 10)


I’ll conclude today by looking at how the reports shown in my previous blogs translate in to a SWOT analysis.

The Strengths-Weaknesses – Opportunities – Threats (SWOT) for each site is entirely local.



In this case, there are the skills, expertise and tools available, but resources are stretched across so many machines that matters are essentially ad hoc or reactive at best.  This is compounded by the project culture which does not encourage good infrastructure-wide processes.

The highlights of a study are site specific and need to be aimed at the management to try to establish a path forward, the highlights in this case were assessed as :



The summary below tries to identify the key items to address in the roadmap.  For each strategic target, tactical activities are identified.


The next move is suggestions for some immediate next steps, to help focus the management team on practical solutions to address the worst of the gaps in the current process.

AHR Next steps = Review and improve:

                Current Infrastructure processes and interactions

          All 16 CMP Good Practice checklists in detail

          Real benefits of VMware program

          Pilot prototypes on top 2 services per domain:

        CDB

        Performance reports

        Performance patterns and thresholds

        Performance analyses and trends

        Capacity plans

          Pilot prototypes on top 2 servers per domain

        Workload characterization

        Workload forecast scenarios

        Resultant resource demand scenarios

This next piece relates back to the original maturity review and suggests current levels of implementation of the main ITIL capacity management objectives.  It then adds a suggested target level after an agreed project to address them, with some notes and caveats.


Over this series I’ve indicated the findings for five sites and discussed two of them in more detail, showing a sample selection of reports for one of them. 



The same approach and extended set of checklists can be used for every site, but the factors that make each site different will apply to a gap analysis project too, so every study will have its own character. 

I hope I’ve given you an appreciation with my couple of examples and if you’ve got any questions feel free to ask. Don't miss our webinar this afternoon 'Demystifying z/OS Capacity Management for Distributed Planners', there's still time to register 
http://www.metron-athene.com/services/training/webinars/index.html

Adam Grummitt
Distinguished Engineer

Footnote: Adam's book "Capacity Management: Best Practice (ITSM Library)" is available online and in book stores http://www.amazon.co.uk/Capacity-Management-Best-Practice-Library/dp/9087535198/ref=sr_1_1?s=books&ie=UTF8&qid=1335950364&sr=1-1




Monday, 17 November 2014

Capacity Management Process deliverables (Mind the Gap series 9 of 10)


Still staying with our study of our Retailer I promised to look in further detail at the tools and Capacity Management Process deliverables and how well they are using them to meet their needs.

This report relates the deliverables from the capacity management process to the tools available per target domain


The first column in this case shows the various target domains, the second the tools available.  The last column shows the deliverables.  The arrows indicate which parts of the tools are exploited to meet the required deliverables.
In this site there is a wide mix of tools from different datacenters that have been merged slowly over the years.  So as well as supporting a number of different variants, there is also a central objective of “standardising”. This typically adds an extra dimension in terms of support requirements which is an overhead until fully established and the previous standards are retired.
The next reports reflect a more detailed analysis starting with this review of the typical metrics available and those that are collected at this site circled in green.


There is a potential issue with this report in that each platform tends to be somewhat different in its level of detail.  That is, most mainframe servers collect and store a wide array for SMF/RMF record types, whereas most UNIX servers will store a limited selection of statistics and Windows servers likewise.  However, there is often a reasonably consistent attitude to storage and exploitation of metrics across the domains at this level of detail.
This report reflects the level of data required for effective performance engineering and the need for good data flows with development and testing. 



Again circles are used to indicate what happens at this mythical site.
This shows a typical list of reports required, with the related key metrics in the second column.  Once again green circles are used to highlight the areas addressed within this site.

A standard capacity plan template is used to highlight those areas that are incorporated in the sites internal capacity plans.


In this case it is clear that there are few formal capacity plans but within the capacity management team a number of key practical areas are assessed.  The lack of a formal report leads to a poor perception of the abilities of those involved which could readily be rectified by a more open and transparent reporting system.
On Wednesday I’ll be concluding by looking at how this all translates in to a SWOT analysis.

Adam Grummitt
Distinguished Engineer


Friday, 14 November 2014

Typical report relating capacity management to other main ITSM processes (Mind the Gap series 8 of 10)

Staying with our Ad hoc retailer the second typical report is that relating capacity management to other main ITSM processes.





Across the top are the three main sub-processes, resource/component, service and business capacity management.

Down the side is a list of all the related activities.

In each cell there is a standard description of typical relationships and their manifestations.

Again, traffic-light colors are used to indicate where there are issues and risks with appropriate local edits to each description.

Grey or black is used to indicate comments that are not significant at this site.

Red is clearly used for major issues.

Orange is used for issues that are of concern.

Off-yellow is used for areas where things are “on the edge” of becoming an issue.


In this case there are some significant areas in red – though some risks have been left in grey as the site has made conscious decisions as to their lack of local relevance.


A third report shows the data flows across ITSM processes with the effectiveness of interfaces between capacity management and other processes.

 


The first column describes what is expected from other processes to capacity management.

The second column shows what is expected from capacity management to other processes.

This is then addressed for availability management, then change management, configuration management, continuity, financial, incident, problem and so on.


These initial reports allow a preliminary view of the capacity management process to be discussed early in the project.  This initial deliverable is a useful test of the water to make sure that there are no surprises later.


At this sample site, an amalgam of real sites, development rules the roost and projects are well coordinated.  The IT infrastructure is viewed largely as an event management facility to ensure availability.  Performance targets are few and are essentially deadlines for batch completion overnight.


Initial findings in the Capacity Management Process were:

Monitoring and event management is in place
 
Performance and Capacity Management less so

Largely living of previous ‘fat’ in headroom

Essentially now re-active based on fire fighting

Ad hoc reactions to meet specific project needs

SDLC development driven rather than ITIL

No infrastructure or IT service delivery view

Project driven rather than process (need both)

Application oriented rather than service

Silo oriented, kingdoms, frontiers, lack of liason

Little capacity or performance detail in Project Initiation Charter(PIC)

PIC tends to be sized only for new projects

Little workload review of growth or business need

SLA tracking and reporting mostly on availability

Total service and server views missing


The main concerns, identified in the interview process, were the lack of communication or awareness of other platforms or processes.  A silo oriented set of fiefdoms ruled each domain.


Sometimes such observations are already well known to those who asked for the project, but not always, so presentation of this initial tester often yields an interesting meeting.


Next week I’ll look in further detail at the tools and Capacity Management Process deliverables and how well they are used to meet their needs…..don't forget to sign up for our webinar 'IT Capacity Management 101' 
http://www.metron-athene.com/services/training/webinars/index.html

Adam Grummitt
Distinguished Engineer


Thursday, 13 November 2014

Demystifying z/OS Capacity for Distributed Planners

Very few people can say that they've been involved with Capacity Management their entire career, but I can.  I was hired by a regional utility company about 20 years ago to be their planner for their new production platform called Unix.

No, Unix wasn't new then, either.  It was just the first time that company had trusted anything important on anything other than the mainframe.

To me, the mainframe has always been a bit mysterious.  I am well aware that the Capacity and Performance Management disciplines, techniques, and overall mindset were developed by people who knew and loved the mainframe.  It's not a secret to me -- companies that rely on mainframes tend to have the most mature Capacity Management processes and understand the need for proper planning on their distributed platforms, as well.

So our webinar on November 19 is exciting to me.  I've learned quite a bit about the mainframe and those that love them in the 11 years I've been with Metron.  I know which SMF/RMF records are important to the planner, I'm aware of what athene® ES/1 brings the planner, and I'm well aware of the richness of the data that mainframe folks have always been blessed with.

And yet...

The mainframe is STILL mysterious to me.

Are you a Unix, Windows, or VMware person who would like to hear a bit more about mainframes and what's similar and different between them and distributed platforms?  Block an hour in your calendar on November 19 and come and listen to experienced mainframe and distributed capacity planner and Metron Principal Consultant Charles Johnson.  He'll talk about why mainframes are still around decades after people started predicting their demise and decades after people started comparing those that tend to them as dinosaurs.

All I know is that I have a lot more respect for those people I met at the beginning of my career who appreciated the mainframe for what it could do...and still does.  Most of those people are winding down their careers now -- I hope there's a new generation of people willing to step up and learn...our webinar could be the first step for many.

Register for Metron's webinars here!

Rich Fronheiser
Chief Marketing Officer

Wednesday, 12 November 2014

The first of a number of sample reports for a typical retail IT datacenter ( Mind the Gap series 7 of 10)

For the retailer, life was dominated by the economic downturn and so everything was ad hoc.


The Ad Hoc Retailer was a large enterprise with major data centers and a large IT staff, including maybe 1000 developers. However, it had a chequered history in recent years with mergers and new investors but an overall drop in market share, revenues and profit.


The net result was that its share value had halved in the previous year and staff morale was low. Every decision was taken in the light of potential job losses and the need to be frugal. Headroom that had been established in the past was now effectively taken up so that capacity was becoming a major issue.


The enterprise had an SDLC culture with PICs (Project Initiation Charters) for any task requiring more than one days individual effort. However, there was little infrastructure leeway left, and there was little infrastructure management left in place. Previous processes had been dropped as staff had gone and more project work was being done by fewer people with more services on more servers.

There were ideas about a Service Catalogue and Service Level Agreements but they were emergent. There were many domains and also a history from different data centers so that there was a large number of what felt like separate kingdoms with barriers and lack of communication.

Some of the project meetings were the first time that capacity management people in different teams had met.

This is the first of a number of sample reports for a typical (but for the purposes of anonymity, mythical) retail IT datacenter



The first few columns show the CMMI maturity levels and an indication of the population of sites that are in each; the state it reflects (active/reactive) and some ITSM symptoms that indicate the level attributes.

Then there is a column for the corresponding ITSM activities and the capacity management activities typical of each level.

The cells for each intersection show relevant features, with traffic-light colors used to indicate where there are risks and dangers associated with the current level of maturity. The traffic-light is extended throughout the reports to allow for five colors.

Grey or black is used to indicate comments that are not significant at this site.

Red is clearly used for major issues.

Orange is used for issues that are of concern.

Off-yellow is used for areas where things are “on the edge” of becoming an issue.

The circles indicate where this particular site mostly lies.

I’ll share some more reports with you on Friday....

Adam Grummitt
Distinguished Engineer

Monday, 10 November 2014

The Finance House which was doing effective capacity management but wanted to improve governance( Mind the Gap series 6 of 10)

The Finance House, Triplex Finance House, TFH, was known to have a highly qualified and large team of IT professionals (nearly 2000 with about half involved in development, with up to another 1000 sub-contractors called in on demand for major projects).

The service provided is so critical that downtime has to be minimized and performance optimized.

Although there is a mix of domains, as applications become more and more multi-tier, so it became felt that the capacity plan needed to be enterprise wide. However, the degree of metrics and planning in each is somewhat variable. Also, the aggregation of a number of separate plans from different authors into a single document takes a lot of time and editorial passes before it is acceptable to all. Such a large document tends to develop a life of its own.

Although the capacity management processes were in place, the coverage was not complete and a lot of reports were out-of-date. The Capacity Management Database(CMDB) changes were not advised to the Capacity Management Team(CMT), so there were a significant number of essentially defunct machines still being reported on by various hand-crafted reporting regimes over the years.

The main areas needing enhancement lay in those of communication with other teams such as development and testing.

Service Level Agreements(SLA) had some performance criteria, essentially on throughput and often related to what were effectively batch jobs updating the data warehouse.

The initial gap analysis found that:

Services had been effectively categorized but the service catalogue was still emerging.

Resource/component capacity management processes were well established but service capacity management was just being introduced for some category 1 services.

Business capacity management is identified as the next stage and will require more work
on business drivers, KPIs and QoS.

An initial dashboard for management on the capacity management process itself was well
structured but was completed to show everything as “all green”. This had the unsurprising but unanticipated effect that the next request for expenditure was rejected as there were no current problems.

The reporting process was then designed to incorporate an accurate reflection of some
key metrics like coverage of services, servers, deliverables and underpinning metrics.

The main conclusions from the study were the need to enhance:

Quality of info and data flows between CMT and

- service management

- change management

- configuration management

- production monitoring

Measures of business drivers and KPI’s

Reporting to track and predict SLA violations

Service-server mapping and configuration changes

Ability to clone reports with automatic trending etc

Ability to analyse reports and identify bottlenecks

Workload characterization and forecasting

Modeling scenario handling

Essentially it revealed a need for more coverage, better information flow and more automation.

On Wednesday I’ll be looking at a Retailer, don't forget to register for our next webinar in the meantime 'Demystifying z/OS Capacity Management for distributed planners' http://www.metron-athene.com/services/training/webinars/index.html

Adam Grummitt
Distinguished Engineer


Friday, 7 November 2014

Gap analysis - five studies in the hard pressed retail sector (Mind the Gap 5 of 10)

As mentioned at the beginning of this series the examples I’ve chosen to share with you are typical of the hard pressed retail sector where the economic downturn is having an impact on everything.

So, rather than capacity management defining what equipment is needed, it’s more likely to be told it will have 10% less money to provide the same services including double the number of users (due to mergers and acquisitions) with the same hardware and 10% less staff!

Be frugal, do things just in time, but make sure that the mission critical services continue to be supported with high availability and good performance.

I’ve chosen five major studies at particular sites, they can be reviewed at a high level to
demonstrate the variety of approaches and expertise involved.

A short study at a successful ecommerce site will demonstrate the difficulty of establishing business processes when the business growth was such that capacity planning amounted to deciding how many new machines should be added to each pool of the multi-tier solution every day.

A long study at a telecoms provider will show the situation where a CMT was doing CMP well and ensuring good use of existing capacity and planning well the future, but without reporting it widely or well.

A short review of a public sector site will show that it had understood the requirement for ITSM processes and documented proposed ITIL processes in some detail but had little resource to actually do it.

A short study at a finance house will show that there was the experience and expertise within the data center for effective and efficient capacity management, but less coverage outside their own domain.

The ad hoc retailer was too busy reacting to ad hoc project demands to establish any processes.

Each of these five sites have different levels of CMP. Reporting on their own CMP activities
varied from nil to extensive. But for different reasons, all five required the capacity management process to provide the measurement numbers they needed for all the services provided by IT.

The ecommerce site felt that non-optimal upgrades were fine so long as the growth continued. A short review of the capacity would remove the panics for performance.

The telecoms provider had the process in place but had no demonstrable deliverables or external measures of performance.

The public sector site had limited resources that were absorbed into daily performance issues and project work rather than establishing the framework to avoid such issues.

The finance house had previously worked to a protocol of triplexing everything and making sure that there were always three levels of support for every key component (such as power supply by grid, generators and solar panels with triplexing of each). However, as times change, so the move was towards duplexing and 50% spare capacity based on peak of peak predictions.

For the retailer, life was dominated by the economic downturn and so everything was ad hoc with less staff and more servers.

In all cases, much the same approach was used for the gap analysis study.

I’ll be looking at the results of two of these studies in more detail, beginning with the Finance House which was doing effective capacity management but wanted to improve the governance on Monday

Adam Grummitt
Distinguished Engineer

Thursday, 6 November 2014

CMG 2014

The boxes are packed and a whirlwind week at Performance and Capacity at CMG 2014 is almost history.

Metron's had an active week here -- we presented 2 papers, participated in an APM panel, manned an exhibitor's table in Vendor Alley, and met a lot of people who are interested in implementing or maturing their capacity management processes.  Some of those we met at our table, others we met between sessions or during the coffee breaks, and others we met over evening dinners or even in the hotel bar when all of us were looking to wind down from a busy, but productive day.

Wednesday night we had a vendor presentation where people had the chance to see details about the future of athene® and also details on our mainframe solution athene® ES/1.

As someone who's been at 14 of the last 16 CMG events, I've learned a lot this week and have felt energized by the attendees and the presenters -- all of whom get together annually to exchange knowledge.  The CMG has become a smaller event, but I'm encouraged by the number of first-time attendees this week who realize that performance and capacity management are important disciplines in today's dynamic data centers.

Now the hard work begins for everyone -- taking that knowledge and information gathered this week and using it in the next weeks and months to show a return on investment on their attendance at the event this week.  Metron will be right there to work with many of those attendees and their organizations to implement Capacity Management solutions to help make that return a reality.

Let's get to work.

Rich Fronheiser
Chief Marketing Officer

Wednesday, 5 November 2014

Procedures for a gap analysis (Mind the Gap series 4 of 10)


The procedure for each gap analysis consultancy will vary somewhat in the nature of things. A lot of effort will be put into a Statement of Work or similar document ahead of the assignment. Much of that may well be changed in the light of the emerging realities.

Nonetheless, the approach is usually much as indicated below:

The agreed range of platforms has to be defined - so that the scope of the review is clear from the start. This will in turn help to identify the key areas and hence the key players to be interviewed.

Talk to the Capacity Management Team - each interview, rather than trying to “score” activities on some arbitrary scale, is focused on current activities and deliverables.
Each interviewee is asked to submit sample performance and capacity reports, ideally some that they are proud of and some that they had pain in creating and any that they felt they should be able to generate but can’t.

Assess against relevant good practice checklists - all the discussions are held in the light of the checklists for Good Practice in the areas selected by the CIO.

Readily available reports submitted for analysis  - all of the interviews and sample reports are then analysed and a review produced.

Reveal the gaps (known and unknown) – the review is produced revealing gaps and identifying SWOT (Strengths, Weaknesses, Opportunities and Threats) and next steps.

Depending on the site and the number of target systems this exercise is typically two to four weeks. It is essentially a short snapshot where the consultant is totally immersed in the material to produce a quick report to management. It is not a vehicle for an extended on-site army of management consultants.

I’ll be sharing some case studies with you on Friday. In the meantime why not join our Community and get access to white papers, on-demand webinars and more....http://www.metron-athene.com/_downloads/index.html

Adam Grummitt
Distinguished Engineer

Monday, 3 November 2014

The capacity management process (Mind the Gap series 3 of 10)



The above is as detailed as ITIL gets in describing the capacity management process, subprocesses and activities. It presents a useful summary of the main activities and presents them at three levels being the resource or component level, the service and business levels. It’s a neat overview and shows the essential nature of the capacity database and capacity management information system at the heart of the process. It is a key part of the 50 pages or so that describe the capacity management process in ITIL. The same topic is also discussed in ISO/IEC 20000 but summarized to just a few pages.

This shows the spectrum of implementation of ITSM and capacity management within it
across sites.



There are levels of process maturity suggested which correspond roughly to typical levels of ITIL implementation. However, for any one application, it may move from one level to another during its life cycle – which on average is only around 18 months.
Thus, at the start, it may be subjected to intense performance engineering and QA trials to derive resource requirements. But once it has settled down it may well be little monitored, or perhaps simple utilization trends maintained.

Also, for any site, given limited resources, it's likely that decisions will be made to go further up the maturity grid for servers that are expensive (mainframes and super-servers) and for services that are mission critical. Hence there is a “curly bracket” indicating that most sites will choose to monitor all servers and services, but only trend significant services and only model expensive servers.

My experience is that the higher up the grid that curly bracket is implemented, the higher is the regard for IT within the company and its perception externally. These are important considerations for a CIO whose life cycle on average is also around 18 months… so he or she is likely to be very interested in a gap analysis of what is actually going on in their capacity management domain.
On Wednesday I’ll be taking a look at the procedures for gap analysis.

Adam Grummitt
Distinguished Engineer