Suppose you choose to capture performance measurements from your servers once every 5 minutes, which is a reasonable compromise between losing some detail and drowning in data. A week or a month later, you can’t go back and ask for more data points during the same time period – the opportunity has come and gone. Any practical data collection exercise involves taking a sample of all the theoretically possible values. The question arises – how confident can you be that the data points you happened to sample are in fact representative of the underlying “truth” (whatever that is)? Do you want to be 90% certain that they are representative (i.e. you will accept a 10% chance that they are not representative), or 95% certain, or 99% certain? This is where Confidence Intervals can help.
A Confidence Interval is represented by a pair of lines, one on each side of the trend line. The detailed mathematical description of a confidence interval is well outside the scope of this article. However, put simply, it shows how sure you can be that the “real” trend falls within a given range, given the finite sample of measurements that you have necessarily made. When you ask for confidence lines to be drawn around a trend, you need to specify the “confidence level”, which mathematically must be under 100% - you can never be completely confident, based on just a sample. As mentioned, common values of confidence level are 90%, 95% and 99%.
Here is a practical example of a projected trend, with confidence intervals set at the 95% level. The data comes from a server supporting an application that was taking on an increased number of users over a period of a few months. The objective was to predict the likely CPU utilization during an average hour at the beginning of December, so that any required upgrade could be planned in advance of the Christmas holiday season.
A simple linear trend, based on one calendar months’ worth of data, leads to a predicted average hourly CPU utilization of 61.4% on 1st December
Look at the pair of confidence lines. They are diverging rather rapidly. Generally this means that there aren’t enough data points for a reliable prediction. On the evidence available, you can be “confident” that the projected average CPU utilization on 1st December will be somewhere between about 35% and about 90%. This is a pretty wide range, which you could probably have guessed without any need for statistical support.
To see what happened at the end of the following month join me again on Friday
Andy Mardo
Product Manager
No comments:
Post a Comment