Metron - Capacity Management: February 2013

Wednesday, 27 February 2013

Capacity Management for the mainframe

The mainframe, while retaining its traditional central role in IT organizations, has evolved to also become the primary hub in large distributed networks making managing its performance as critical as ever.

Many organizations mainframe experts are coming up to retirement age and the worry often is about who will replace them.The breadth of the mainframe can also mean there are specialists in individual areas such as DB2 and CICS.Companies are faced with a dilemma and in the current climate they have barely enough staff to cover everything as it is..

So it's no wonder that Companies have wished for the automation of mainframe systems performance analysis, software that can provide an independent view across all areas.

This, however, has not been so easy to find.....until now.

We know what a pivotal role the mainframe plays in many large Companies and we’ve launched our new ES/1 NEO, expert systems performance analysis for zOS.

It pulls together analysis and expertise across all major mainframe areas such as DB2, CICS, WAS, IMS and zVM. It automatically highlights potential problem areas. More to the point, it has a vast library of potential recommended solutions to any problem identified, meaning it can provide independent advice at any time for hard pressed teams, to resolve disputes or if skilled staff resources are in short supply.

To find out more about ES/1 NEO come along to my webinar on 28th February http://www.metron-athene.com/services/training/webinars/index.html

Charles Johnson
Principal Consultant

Monday, 25 February 2013

Capacity Management: Guided Practitioner Satnav – Why should I go there? (17 of 17)

The conclusion of this series is that capacity management is a worthwhile practice when performed well. It needs clear objectives and good instrumentation.

So to answer the question ‘Why should I go there?’ it is because when Capacity Management is practiced well it saves money.

It does however require some expertise and a good attitude from all as regards openness, transparency, avoidance of silo mentality, good liaison and the right ABC environment.

From my research my key observations are:

• Capacity management GPS has been applied at a number of sites

• Most sites are not where management think they are

• Most sites have people who know the real situation

• It takes openness & technical awareness to reveal the truth

• Demand management is often minimal

• Project management is often uber-all

• Performance is often an after-thought

• Next steps are often short, medium and long term

• Usually related to liaison as much as process

• Often related to making more use of extant tools

Hopefully not all reports are filed on the shelf and the more business friendly you make yours the more likely they are to be read.

Avoid the pitfalls in the above points and CM should pay you back many times over.

To re-iterate…………. Why should I go there?

Because when Capacity Management is practiced well it saves money.All it needs is in-house believers to carry it forward…………

Don't forget to be in with a chance to win a signed copy of my book there's still time to enter, details can be found at http://www.metron-athene.com/socnet/index.html

Adam Grummitt
Distinguished Engineer

Friday, 22 February 2013

Capacity Management: Guided Practitioner Satnav – Capacity Plan (16 of 17)

A longer term goal should be the production of the Capacity Plan. This is a key deliverable from the Capacity Management process and provides a capacity strategy aligned to the business.

The capacity plan is written in such a way that business forecasts are translated to technical requirements. This can include discussion of saturation points of existing architecture (which may be determined via modeling) and recommendations to avoid hitting those saturation points.

The scope of the capacity plan is strategic (more long-term), tactical (short-term future), and operational (discussing day-to-day issues).

The main goal of the capacity plan is the same as the main goal of the process (not coincidentally): To describe the most cost-effective way to meet requirements (SLAs) both now and in the future.

Scenario planning can be a major component of a capacity plan, as potentially many different scenarios could occur in the next business planning cycle and Capacity Management should consider all of these.

The diagram above provides a very high-level process of producing the Capacity Plan; currently the majority of the capacity data collected is at the component level. This is an essential stage in the process development and provides a solid foundation for moving to the next levels: service and business capacity. The boxes highlighted in orange indicate that currently these areas of missing with regards to the virtualisation environment. Although it is acknowledged that there is a drive within the ESM team to monitor the virtualised servers at the service level. Moving forward it is expected the appointment of a Capacity Manager should start to address these absent parts of the process.

Procedures and work instructions are the next step.

The ‘top level’ slides in this and my previous blog need to be underpinned by more detailed tables identifying a skeleton framework for the infrastructure, such as some of the examples indicated in this slide

Terms of Reference

Process descriptions

Procedures and work instructions

KPIs

VI Metrics

There is too much detail to put in to a slide but hopefully you’ll find this indicative of the type of necessary detail required to take a project forward.

I’ll conclude my series on Monday with ‘Why should I go there?’

Adam Grummitt

Distinguished Engineer

Capacity Management: Guided Practitioner Satnav – Longer term improvements (15 of 17)

CDB/CMIS: The CDB/CMIS is critical in providing an effective capacity management function or process. There are quite a few data sources that provide important information with regards to management of the virtual infrastructure capacity. An important step is to start building the culture where people are exploiting this information and making more informed decisions about capacity.

Demand Management: There needs to be more analysis of the actual resources used by a project and the upgrades it incurs. This leads to a requirement for workload characterisation. This can also be used to implement more effective application consolidation (to reduce software costs). It also paves the way to identify moribund applications that should be retired.

Utility chargeback: Currently all charging undertaken is notional. The current process makes an initial estimate of the required infrastructure; based on input from the technical architect and where appropriate the recommended specifications of the application. The process should be enhanced to capture actual usage associated with a process. This refinement should lead to an improved specification process and may reduce the need to expand the current estate.

Capacity plan: A key part of capacity management is being able to link the needs of the business to the current capacity and determining how those changing needs will affect the underlying infrastructure. This is covered typically by a formal capacity plan. This could be considered the hardest part of the process to implement and is reliant on having a sound component and service capacity process in place.

A longer term goal should be the production of the Capacity Plan and I’ll look at this in more detail next week…..

Adam Grummitt

Distinguished Engineer

Wednesday, 20 February 2013

Capacity Management: Guided Practitioner Satnav – Medium term improvements (14 of 17)

Still sticking with the status of capacity management within the Wintel virtual infrastructure (VI) farm at a large enterprise, we can now look at medium term improvements

SPM: Currently there is a recruitment exercise for a role to be responsible for Capacity Management. This is an essential step in providing the appropriate process structure to unify the reporting, data sources and start to move towards a more pro-active approach.

Proactive: The focus is currently on that of system or component level monitoring. Whilst this gives an accurate report of how the cluster, host or guest is performing, the next step is to monitor at the process level. Enabling this process level information is the first step in workload characterization and a requirement for the proposed longer term step of a further application consolidation activity. Also at process level is the required analysis known as ‘process pathology’ which can be done even at the VM level to identify ‘rogue’ VMs or processes and ‘flatline’ VMs or applications. Rogue processes include memory leaks, program loops and other problems that do not show up in an overall system level analysis but yield apparent (but unnecessary) excessive demands for resources.

Services: The next stage in the maturity of the capacity management process is to work with the service level management process in establishing response time thresholds with the SLA and generally linking the performance of a service with the underlying infrastructure. Moving forward this will provide an additional KPI as to the relative performance benefits of the virtualisation project and it will provide a valuable feed into the service level management process.

Portal: The proposed vehicle for a reporting regime is the capacity management portal.

The capacity reporting available and how it should be used is discussed throughout this blog series. The process diagram below should further clarify both the sources of information and the expected output

Whilst it is acknowledged that the unification of the various independent databases holding performance data is unlikely, at least in the short-term, the key part of the requirement is the production of a set of capacity reports that provide a ‘pane of glass’ view into exactly how the virtual environment (and longer term the entire estate) is performing and how much capacity is available.

In order to provide that singular usable view it is important that the following steps be taken:

• Provide a common look and feel to the reports, independent of their source

• Ensure an appropriate educational exercise has taken place i.e. make sure people know where the reports are available and what they mean

• Provide easy access for the capacity portal usually via the web and perhaps email for any exception reports

A longer term goal would be integrate with the Configuration Database to incorporate important relationship and topological information. This provides a valuable service level capacity data feed and will allow for the reporting/planning activities to be undertaken at the service level as well as component level.

I’ll be discussing longer term improvements on Friday………

Adam Grummitt

Distinguished Engineer

Monday, 18 February 2013

Capacity Management: Guided Practitioner Satnav - When will I get there? Categorization (13 of 17)

Previously we looked at the status of capacity management within the Wintel virtual infrastructure (VI) farm at a large enterprise and will now take a look at the categorization process.

The initial phase of the migration has been designed and managed using the Capacity Planner service provided by VMware. Whilst this does an admirable job of calculating the physical requirements for virtualizing this list of potential candidates it doesn’t provide any categorisation for the future allocation of these servers.

Being able to categorise the servers prior to virtualisation will provide:

• Additional verification on the sizing of the infrastructure

• Categorized existing servers to provide a valuable feed into Capacity Planning by attaching a resource requirement to an existing workload to create a frame of reference for future server requirements.

The resulting categorization should be used in the power classification used for specifying the VM or resource pool, failover requirements etc. The process provides a basic method for categorizing migration candidates and how much capacity they will use.

The first stage is determine the physical aspects of the categorization process and to calculate the normalised power rating for the physical server to provide a rating that reflects its usage comparable to the rest of the estate. The second stage is to determine how this will relate to the virtual infrastructure. This is done by calculating the power rating of the ESX server, determining the P2V ratio i.e. how many virtual machines you plan to run per host and then using this information, calculate the various categories. In this case the server has a SpecInt of 40 with a peak utilization of 15% and hence a rating of 6.

Categorization example

Physical: the original server has a peak CPU utilisation of 15% and a SpecInt rating of 40, so the normalised power rating of the original physical server is derived by

N = S x (U/100) = 40 x (10/100) = 6

Virtual: the new server to support the total VI has a SpecInt value of say 200 and the plan is to run with a P2V ratio of say 1:20 and the tiers to be defined are bronze, silver, gold and platinum:

Bronze = 200 / 20 = 10

Silver = Bronze x 2 = 20

Gold = Silver x 2 = 40

Platinum = uncapped

Thus the initial physical server above would be rated as a bronze in P2V with associated priorities, but could be promoted in the light of any business priority or user reported problems to a higher grade. Or to put it a different way, with these definitions, it is possible to virtualise 20 bronze servers, 10 silver servers or 5 gold servers per ESX host. Moving forward we have an approximate way of determining the virtual requirements of a future migration candidate or a new server by comparing it with an existing installation. Using this process would mean that instead of having to say to expect 100 projects of varying sizes, you may be in the position to actually put a more specific number to these predictions.

I’ll be looking at medium term improvements that can be expected on Wednesday…

Adam Grummitt
Distinguished Engineer

Friday, 15 February 2013

Capacity Management: Guided Practitioner Satnav - When will I get there? (12 of 17)

When will I get there? This year, next year, sometime, never?

SatNav gives a typical answer in hours and minutes but your detailed time depends on the route selected and the options taken and the answer will be based on accumulated experience gathered over many journeys.

When faced with a vague question like ‘When will I get there?’, the best option is to try to break it down into something that can be more readily answered. Perhaps you can break it down into short term, medium term and long term. The precise duration in terms of months and years is likely to be influenced by the local culture and appetite for change.

After a typical ‘gap analysis’ consultancy, the main deliverable is often perceived to be a checklist of objectives expressed in these sorts of terms.

The example that follows is one such. It is based on the status of capacity management within the Wintel virtual infrastructure (VI) farm at a large enterprise

Assets: The attributes in the VI candidate spread sheet need to be reviewed in the light of the emerging classification as proposed in this report. The current entries have been derived from a number of sources in a variety of asset registers and ideally would be standardised to reflect required actions.

ESM: Currently there is a wide range of metrics captured and certainly a sufficient wealth of information to provision an informed capacity management process. The current issue is the disconnection between the ESM team and the infrastructure teams. The work undertaken by the ESM team has ensured that all key system level metrics are available and can be reported on. The cultural issue means that people either aren’t aware of this availability or haven’t perceived a requirement for this data.

Metrics: There are many metrics available from Virtual Center which if imported correctly should provide an adequate range of VI capacity reporting opportunities. The VM metrics required include any classifications agreed such as resource capping, priority, backup, and other potential categories such as gold/silver/bronze services or small/medium/large resources. The current KPIs are covered in the ESM, but need to be extended to add CPU utilisation/servers, % reduction in response time, % reduction in physical estate and % reduction in software licence costs.

Reports: A large number of explicit new reports and associated interpretation activities are proposed.

The categorization process is something that I’ll deal with next time………

Adam Grummitt

Distinguished Engineer

Wednesday, 13 February 2013

Capacity Management: Guided Practitioner Satnav – Demand Management (11 of 17)

A typical management edict for IT is to ‘do more with less’. But typically there are more requests for work than resources. Demand management is commonly proposed as a way to understand and throttle demand from customers.

It is important as requests for projects often outstrip the resource capabilities of service providers.

Demand management is described as a capacity management activity within service delivery in ITIL V2 focusing on degradation of service due to unexpected increases in demand or partial interruptions to service. In ITIL V3 it is allocated to service strategy with a wider view of its scope and links with capacity management identified, but still focused on patterns of business activity and user profiles.

In this blog it’s used to include both of the above as well as establishing longer term practices to deal with handling requests for new services, avoiding un-necessary peaks in workload, provisioning of resources, setting service priorities and quotas, chargeback and related activities. It covers the entire spectrum, from over-provisioning without regard to cost to under-provisioning such that there is no headroom and hence capacity problems.

• Control demand for resources to meet levels that the business is willing to support

• Optimize and rationalize demand for the use of IT to achieve optimum provision

– One extreme of over-provisioning without regard to cost

– Other of under-provisioning so that there is no headroom

• Understand and throttle/smooth peaks, if possible, in customer demand or priority

• Control degradation of service due to peaks in demand or downtime/slowtime

• Use budgets/priorities/chargeback/quotas for workloads and new services

• Use ‘levels of critical’ categorization for workloads (gold/silver/bronze)

• Plans for when business requirements cannot be fulfilled due to:

– HW or SW failure

– Unexpected budgetary constraints/ demand increase

• Decisions based on problems being short term or long term?

– Short-term: only mission critical services supported

– Long-term: management of resource constraints

• Need to identify the critical services and the resources they use

– Business plans, Service catalogue, Change requests, SIPs

– Service priorities and their mapping to resources

It is not just about how quickly a new system can be provisioned and how much faster that can be done when everything is virtualised.

You’ll also need to set some time constraints and on Friday I’ll look at ‘When will I get there?’

In the meantime there's a chance for one person to win a signed copy of my Capacity Management book (referred to in the first blog of this series)Simply subscribe to our blog or YouTube Channel,Like us on Facebook or follow us on Twitter or LinkedIn between 31st January and 15th March inclusive to be entered in to our drawing.

Like, follow or subscribe to 3 media or more and receive an additional free entry.
Only one entry per person per media is valid and no cash alternative is available.
The winner will be notified and published after the drawing on 29th March 2013.

Adam Grummitt

Distinguished Engineer

Monday, 11 February 2013

Capacity Management: Guided Practitioner Satnav – Availability (10 of 17)

Availability also needs to be addressed in a practical fashion. Not ideal objectives of 99.99% or similar imposed across the board without reference to the significance of the service or the cost of achieving the level suggested.

Why provide a gold-plated service for low priority discretionary work?

The above shows two popular approaches. The first is a blanket 99.99 sort of approach which is a gross averaging mechanism. Consider a single 8 hour downtime event. This would still leave 99.9% achievable over a year, but is only 93.3% for the week it occurred.

It is better to use a definition that reflects the local situation. Typically outages of a given duration can be accepted within a given period. Say up to 6 minutes down is OK if it occurs just once a week and once up to 60 minutes in a month, once up to 4 hours in a quarter and once up to 8 hours in a year. This equates to 99.5% but is meaningful to users and can be monitored and policed effectively.

Once that is achieved, other issues arise. What is‘up’ and what is ‘slow time’?

Continuity and disaster recovery needs to be sized to ensure that a reasonable service will continue to be supplied within the DR defined levels.

There are many practical architectural issues to be addressed:

• Data security to reduce impact of DR:

– Backups made to tape/disk on site and sent off-site regularly

– Data replication to an off-site location so only system sync required

– High availability systems to keep both the data & system replicated

• Precautionary measures:

– Local mirrors of systems and/or data and use of RAID

– Surge protectors, UPS and/or backup generator, fire prevention

– Antivirus, antibot software and other security measures

• Stand-by site at:

– Own site with high availability

– Own remote facilities with SAN

– An outsourced disaster recovery provider

• DR service

– Priority of service determines if included DR service

– DR reduced performance and reduced traffic constraints as per SLA

As well as all these, the performance and capacity of the DR site needs to be properly sized. Given that effective capacity management is already in place for the production solution, extrapolation to assess the DR site is comparatively straightforward and may not require a ‘hot test’ but models can be used to justify configuration and cost of the Disaster Recovery site.

This leads us nicely in to my next topic Demand Management and I’ll cover this on Wednesday .........

In the meantime there's a chance for one person to win a signed copy of my Capacity Management book (referred to in the first blog of this series)Simply subscribe to our blog or YouTube Channel,Like us on Facebook or follow us on Twitter or LinkedIn between 31st January and 15th March inclusive to be entered in to our drawing.

Like, follow or subscribe to 3 media or more and receive an additional free entry.
Only one entry per person per media is valid and no cash alternative is available.
The winner will be notified and published after the drawing on 29th March 2013.

Good Luck!

Adam Grummitt

Distinguished Engineer

Friday, 8 February 2013

Capacity Management: Guided Practitioner Satnav – SLA definitions (9 of 17)

The SLA definition is often expressed in terms of normal and maximum peaks for traffic with mandatory and desirable performance measures. This seems simple but the devil lies in the detail of the wording of the contract as regards what will happen when different boundaries are crossed.

This slide shows the typical agreement, with green showing as AOK, red when things are definitely wrong and amber for all the areas of potential confusion. Grey is used for ‘not applicable’.

The definitions should also cater for the levels of service in situations of disaster recovery or imposed demand management as well as standard production.

SLA’s tend to come in three major flavours, a three page summary of little practical benefit, a thirty page identification of all the metrics and values involved and a three hundred page version of the same thing after lawyers have reworked it into a binding contract.

Next on our list to look at is Availability and I’ll deal with this on Monday.

In the meantime there's a chance for one person to win a signed copy of my Capacity Management book (referred to in the first blog of this series)Simply subscribe to our blog or YouTube Channel,Like us on Facebook or follow us on Twitter or LinkedIn between 31st January and 15th March inclusive to be entered in to our drawing.
Like, follow or subscribe to 3 media or more and receive an additional free entry.

Only one entry per person per media is valid and no cash alternative is available.

The winner will be notified and published after the drawing on 29th March 2013.

Good Luck!

Adam Grummitt

Distinguished Engineer

Wednesday, 6 February 2013

Capacity Management: Guided Practitioner Satnav -What else has to happen at the same time? (8 of 17)

Once we’ve selected our travelling companions we need to take a look at what else needs to be happening:

• SLAs with respect to performance and capacity

• Availability

• Continuity

• Demand Management

• Things done for real not by rote

• Exception reporting leading to actions

• Automated activities

• Proper use of tools

I’d say all of the above, in pragmatic terms, including measurable objectives and instrumented applications but in their absence, just assume that the current is acceptable and try to assess what level of degradation will prove unacceptable.

Let’s begin by taking a closer look at Service Level Agreements

A Service Level Agreement is a contract, stored in a portfolio which provides a yardstick for the service receiver and service provider. It:

• Quantifies the obligations of provider and receiver and is more important if services are formally charged.

• Identifies functions that the service will provide and when.

• Needs measurable performance indicators:

- availability down-time and slow-time event rates

- continuity priority services performance

- performance response means and per-centiles

- capacity traffic throughput means and per-centiles

Remember to keep them measurable, achievable and appropriate and take in to account:

• Service catalogue/portfolio, business needs

• Instrumentation for traffic levels and app counters

• Agreements with teeth that can be monitored & policed

• Normal, peak and exceptional service levels.

It is in the interests of both sides that the document is clear & measurable. On Friday I’ll cover SLA definitions.

In the meantime there's a chance for one person to win a signed copy of my Capacity Management book (referred to in the first blog of this series)Simply subscribe to our blog or YouTube Channel,Like us on Facebook or follow us on Twitter or LinkedIn between now and 15th March inclusive to be entered in to our drawing.
Like, follow or subscribe to 3 media or more and receive an additional free entry.

Only one entry per person per media is valid and no cash alternative is available.

The winner will be notified and published after the drawing on 29th March 2013.

Good Luck!
Adam Grummitt
Distinguished Engineer