Metron - Capacity Management: IT automatic reporting

Showing posts with label IT automatic reporting. Show all posts

Monday, 20 April 2015

Automatic reporting and alerting(10 of 10)

What do we actually need? What do we actually want to report on? How often do we want to report on it?

If we are applying threshold based alerting to our reports, we need to ensure that the correct values are set. These values may be the utilizations or response times stated within an SLA or based on maximum resource usage. Failure to set the correct values may lead to incorrect alerts being produced, leading to unnecessary investigations, stress and panic. By including availability and response time information within your capacity reports, you improve both the accuracy and increase confidence in your forecasts whilst providing potential SLA breach information in advance.

When creating forecast models, whether trend or analytical models or both, it is important to make sure that the inputs into your model are as accurate as possible so we can make these predictions to avoid any costly or unnecessary performance issues or SLA breaches.

So let's ensure we have a Brighter Outlook. It is crucial that we get the information at all levels as described and store this information typically in a centralised database, so that we have it readily available. Production of adhoc reports on infrastructure usage and current performance of our applications along with implementing automated reports specifically based on our SLA thresholds, enables us to produce early warning alerts on potential breaches and take appropriate action as necessary.

Guest and Host consolidation. If you have over provisioned systems, look at the usage of your virtual machines against the configured resources to see if there is scope to consolidate your guests onto smaller numbers of ESX hosts. You may also be able to have multiple applications running within the same VM rather than have many VMs running a single application.

Plan ahead. By producing trend reports, producing analytical models and predicting what impact is likely to happen to your infrastructure and application performance running within it. Then make the necessary recommendations on upgrades or configuration changes that prevent you encountering any SLA breaches and associated impacts on services. All of this information should be included within a Service Capacity Plan, allowing us to make decisions on whether we need to upgrade, whether we need to standardise our hardware and what the associated costs are likely to be, so budgets can be accurately planned.

It can also help us decide on whether a more powerful and expensive server is actually required when maybe a less expensive, slightly less powerful server, will do just as good a job. Creating analytical models gives you the information you require to make those decisions.

Regular consultation and information sharing with Application teams and other Service Delivery teams will assist you in making the best decisions going forward and allows you to explain why you have made the stated recommendations.

I'll leave you with a quick look at monetary savings on Capital Expenditure (CAPEX) and Operational Expenditure (OPEX)

CAPEX

All the way through this series I have mentioned being able to possibly reduce the numbers of servers required to host all of your virtual machines. This enables you to possibly make savings on the number of licenses you require and on the actual hardware costs. As you start to reduce your physical estate and consolidate ESX hosts, you start to look at the possibility of reducing the size of your datacenter.

OPEX

Make savings by reducing the amount spent on maintenance and support as you reduce the numbers of servers required in your infrastructure hosting your applications and services. By performing application sizing, we can assist in accurately provisioning resource requirements and help eliminate any potential overspend by over provisioning. Further to this, we can actually reduce the physical server count leading to a reduction in the size of a datacenter.

The savings from this approach such as Power Usage - servers & cooling / lighting, reduction in emissions through reduced power consumption but also through a reduction in components as we consolidate servers and finally through usage, by optimally sizing and provisioning.

In my series I‘ve covered what Cloud computing is and how it is underpinned by Virtualisation, the benefits it can provide, things we should be aware of and how putting in place effective Capacity Management can save you time and money.

If you'd like further information on Capacity Management there are a selection of papers available to download http://www.metron-athene.com/_downloads/index.html and don't forget to register for my webinar 'Understanding VMware Capacity' http://www.metron-athene.com/services/training/webinars/index.html

Jamie Baker
Principal Consultant

Wednesday, 25 March 2015

What do they really want to know? - Adding value to your reports with automatic interpretation

Probably the best way of adding value to reports is to generate automatically an interpretation of the data that is being presented. This relieves the analyst from the task of modifying the report text so that it matches the information in the charts. The final sections of my blog present the outline of an Automatic Advisor system, intended to facilitate web-based publication of complete performance reports with minimal user intervention.

Interpretation Techniques

Given a chart with its underlying data, it is practical to apply a number of analyses automatically. In most cases, the analysis can result in the automatic generation of an "exception incident", which will be e-mailed to a responsible person or team. Additionally, the performance analyst can specify that reports be generated and published only if certain exception conditions in fact occur. Depending on the circumstances, the results of an automatic analysis can be turned into automatic advice, which gives guidance on actions that should be taken to avoid a potential performance problem or to alleviate an existing problem.

The following list gives examples of types of automatic analysis:

Top N analysis. This analysis can determine the few busiest or most resource-hungry users, devices, Oracle sessions or similar. Simply identifying them is a good start but better is to see their pattern of activity over time.

Mean value versus thresholds. This is a simple and straightforward check that the mean value of a measured data item is not too high or too low. Failure to stay within threshold bounds can be made to generate an exception event.

Proportion of time within threshold ranges. Typically the performance analyst will want to set two threshold levels for the value of certain critical data items - a lower, warning threshold and a higher, alarm threshold. It is straightforward to report automatically on the proportion of the measurements that fall into each of the three ranges - below the warning value (and therefore satisfactory), between the warning and the alarm level, and above the alarm level. This gives valuable information about the relationship between peaks and averages.

Variability around the mean value. A given set of measurements will have a mean value, and each individual measurement will typically be some amount higher or lower than the mean value. It is often useful to categorise the measured value as "fairly constant", "rather variable" or "very variable" based on the proportion of time when the measured values are close to or far away from the mean value. Again, if variability is a concern, this analysis can be made to generate an exception event.

Trended value versus thresholds. A very useful automatic analysis is to determine the date at which the value of a particular metric is projected to exceed a certain threshold, or to reach some other predetermined boundary value (e.g. zero, 100% etc.) An exception can be generated on several different attributes of the trend, for example the fact that it will reach a boundary value or will cross a threshold value on or before a predefined date.

Correlation analysis. Used carefully and with a sensible selection of metrics, Correlation Analysis can identify causal as well as statistical relationships between data values. For example, it is easy to identify UNIX users or Windows processes whose activity has a large effect on total CPU utilisation. Similarly, the analysis can identify particular I/O devices that are associated with important warning metrics such as CPU Wait for I/O Completion

In order for an Automatic Advisor's reports to be accepted, they must be:

Trustworthy - i.e. the conclusions are recognisably correct and are based on firm evidence

Specific - i.e. the recommendations are specific enough to be acted on without the need for further detailed analysis

Understandable - many advice systems in the past have proved more difficult to understand than reading the relevant technical documentation itself.

Based on the types of interpretation outlined it is possible to offer trustworthy, specific and understandable advice about such things as:

CPU upgrades, for example if utilisation thresholds are currently being exceeded, or if trend analysis shows that they will be exceeded soon

Memory upgrades, for example if paging and swapping rates are (or will soon be) high, or if cache hit rates are low

Upgrades or tuning of the I/O subsystem, for example if particular devices are becoming hotspots, or if queuing is becoming a high proportion if I/O service time.

Each of the underlying reports will contain detailed information about the selected aspect of the selected system, including all the interpretation and advice described previously. For any item that is not shown as "happy", these drill-down reports will show trustworthy and specific advice for making it so.

Depending on the size of the installation and the number of systems being reported on, this Summary Status report could be produced at regular short intervals, so giving an effectively continuous summary of the installation's health.

In conclusion producing a good report manually takes a lot of effort and there are a number of psychological factors to consider, in addition to the purely technical ones:

· What are the needs and interests of the intended recipient?

· How can the report be made credible and trustworthy?

A regime of automatic reports with intelligent interpretation can add significant value to the work of a system performance analyst.

The reports can be interesting, credible, trustworthy - and perhaps most important, timely.

The analyst is now free to concentrate on the serious business of maintaining and enhancing the performance that is provided to the people who really matter - the organisation's customers. For details on our Capacity Management solutions and services visit our website http://www.metron-athene.com

Rich Fronheiser
Chief Marketing Officer

Wednesday, 18 March 2015

What do they really want to know? – Automatic reporting

A performance analyst has to carry out the following sequence of actions:

· Write the outline of the report

· Obtain the relevant data from whatever sources are available

· Create graphs and tables of the data for the required period of time

· Insert these graphs and tables into the report document

· Dispatch the finished report to its intended recipient.

These activities can be time-consuming and tedious, especially when the only significant difference between one report and the next is the name of the server that it relates to, or the period of time that it covers.

However, notice that all those different kinds of reports have a number of common features that make them ideal candidates for automation. They are:

A regular production date. For example, a daily report will be produced at 9 am each day to display the previous day's data. A weekly report will be produced every Monday. A monthly report will be produced on the first of every month, and so on.

A consistent analysis period. The analysis period is the period of time that the graphs in a given report cover - a day, a week, a month, the year to date and so on. A common requirement for an analysis period is to go back some number of days, weeks or months from the date of report production.

A known, stable recipient list. Each report will be sent to a named individual or team, or is intended for saving in a "well-known location", for example a particular folder on a Web server.

Reports can be distributed to your audience in a number of ways but try to observe some common rules to engage your audience:

A common format or house style. Most people react well to having the same kind of information presented in the same way each time. This applies to:

- The sequencing of the report contents

- The appearance of the graphs and tables

- The means of transmission (via e-mail, on portal, etc).

If suitable automation is available, it means that in order to produce a regular report, the performance analyst only needs to carry out the following actions once:

Write an outline of the report. Ideally the outline should contain mostly "boilerplate" text that is not going to change from one issue of the report to the next, though of course the analyst may want to edit the text to match the graphs that are actually generated on any particular occasion.

Create "sample" graphs and tables, from existing data, to illustrate the report. The graphs should tell the story in the most understandable way and the next point expands upon this.

Specify a schedule of when the report is to be updated, for what period of time, and who is to receive it by what means of transmission.

Over the years, a great deal of work has been carried out to determine the kinds of graphical presentation that make data most easily understood and I’ll be looking at some practical examples of this on Friday.

Rich Fronheiser
Chief Marketing Officer