Probably the best way of adding value to reports is to generate automatically an interpretation of the data that is being presented. This relieves the analyst from the task of modifying the report text so that it matches the information in the charts. The final sections of my blog present the outline of an Automatic Advisor system, intended to facilitate web-based publication of complete performance reports with minimal user intervention.
Interpretation Techniques
Given a chart with its underlying data, it is practical to apply a number of analyses automatically. In most cases, the analysis can result in the automatic generation of an "exception incident", which will be e-mailed to a responsible person or team. Additionally, the performance analyst can specify that reports be generated and published only if certain exception conditions in fact occur. Depending on the circumstances, the results of an automatic analysis can be turned into automatic advice, which gives guidance on actions that should be taken to avoid a potential performance problem or to alleviate an existing problem.
The following list gives examples of types of automatic analysis:
Top N analysis. This analysis can determine the few busiest or most resource-hungry users, devices, Oracle sessions or similar. Simply identifying them is a good start but better is to see their pattern of activity over time.
Mean value versus thresholds. This is a simple and straightforward check that the mean value of a measured data item is not too high or too low. Failure to stay within threshold bounds can be made to generate an exception event.
Proportion of time within threshold ranges. Typically the performance analyst will want to set two threshold levels for the value of certain critical data items - a lower, warning threshold and a higher, alarm threshold. It is straightforward to report automatically on the proportion of the measurements that fall into each of the three ranges - below the warning value (and therefore satisfactory), between the warning and the alarm level, and above the alarm level. This gives valuable information about the relationship between peaks and averages.
Variability around the mean value. A given set of measurements will have a mean value, and each individual measurement will typically be some amount higher or lower than the mean value. It is often useful to categorise the measured value as "fairly constant", "rather variable" or "very variable" based on the proportion of time when the measured values are close to or far away from the mean value. Again, if variability is a concern, this analysis can be made to generate an exception event.
Trended value versus thresholds. A very useful automatic analysis is to determine the date at which the value of a particular metric is projected to exceed a certain threshold, or to reach some other predetermined boundary value (e.g. zero, 100% etc.) An exception can be generated on several different attributes of the trend, for example the fact that it will reach a boundary value or will cross a threshold value on or before a predefined date.
Correlation analysis. Used carefully and with a sensible selection of metrics, Correlation Analysis can identify causal as well as statistical relationships between data values. For example, it is easy to identify UNIX users or Windows processes whose activity has a large effect on total CPU utilisation. Similarly, the analysis can identify particular I/O devices that are associated with important warning metrics such as CPU Wait for I/O Completion
In order for an Automatic Advisor's reports to be accepted, they must be:
Trustworthy - i.e. the conclusions are recognisably correct and are based on firm evidence
Specific - i.e. the recommendations are specific enough to be acted on without the need for further detailed analysis
Understandable - many advice systems in the past have proved more difficult to understand than reading the relevant technical documentation itself.
Based on the types of interpretation outlined it is possible to offer trustworthy, specific and understandable advice about such things as:
CPU upgrades, for example if utilisation thresholds are currently being exceeded, or if trend analysis shows that they will be exceeded soon
Memory upgrades, for example if paging and swapping rates are (or will soon be) high, or if cache hit rates are low
Upgrades or tuning of the I/O subsystem, for example if particular devices are becoming hotspots, or if queuing is becoming a high proportion if I/O service time.
Each of the underlying reports will contain detailed information about the selected aspect of the selected system, including all the interpretation and advice described previously. For any item that is not shown as "happy", these drill-down reports will show trustworthy and specific advice for making it so.
Depending on the size of the installation and the number of systems being reported on, this Summary Status report could be produced at regular short intervals, so giving an effectively continuous summary of the installation's health.
In conclusion producing a good report manually takes a lot of effort and there are a number of psychological factors to consider, in addition to the purely technical ones:
· What are the needs and interests of the intended recipient?
· How can the report be made credible and trustworthy?
A regime of automatic reports with intelligent interpretation can add significant value to the work of a system performance analyst.
The reports can be interesting, credible, trustworthy - and perhaps most important, timely.
The analyst is now free to concentrate on the serious business of maintaining and enhancing the performance that is provided to the people who really matter - the organisation's customers. For details on our Capacity Management solutions and services visit our website http://www.metron-athene.com
Rich Fronheiser
Chief Marketing Officer