Tuesday, 31 May 2016

The perils of the wrong type of aggregation


Looking at data and picking out the “story” it tells is often as much an art as a science when it comes to Capacity Management.  A gentle disbelieving of anything you are told also often makes for a better and quicker outcome than taking everything on face value.  Here’s a recent anecdote.

A customer raised an issue with Metron that after a software upgrade a database re-index job was taking a long time and that the application using the database had been working really slowly.  A hasty conference call / screen-sharing session was set up with us, the customer, his boss, and a SQL Server database administrator.  The conversation started with words to the effect of “this started after the upgrade, what’s going on?” - so we looked and we talked for a little while, then it came out that the database re-index job failed because it ran out of disk space.  The next comment “has the database got bigger because of the upgrade, then?”  That’s not our experience, but you never know….so we looked at a graph of the database size over time with a nice simple trend line over the top – the customer had already had this to hand. It looked a little like this:


With the disk size confirmed as 600 GB what was going on? This clearly shows housekeeping of the database as it grows, is shrunk down, grows again.  Even the trend line appears to be going down slightly.  The upgrade occurred in mid-April, so there was clearly no obvious jump up in database size at that time. 

The clue was the x-axis of the graph.  The dates were just the beginning of each month.  The chart seemed to have a nice neat shape to it – too neat, perhaps?
Looking further into the chart, the data was aggregated from the original 15 minute intervals up to the average for an entire day.  So what happens when we plot a chart of some of the later data points, showing each interval instead of the aggregated ones?

When did the re-index job start?…at 07:00 on May 1st.

That’s what polished off the remaining disk space.  The DBA killed the job and manually shrank the database at about 12:00.  During the time the disk had become full, and the application using this database detected this shortage of disk space and shut itself down to avoid losing data.  Only when it was restarted, which was some time after the space had been freed up, did it carry on.

Looking back at previous weekends the shape of the graph was the same each time - gently rising disk usage with a sharp, usually short increase during the time the re-index was taking place, dropping back to previous levels afterwards.

The previous weeks had survived, just, in those cases, so no-one had noticed how close the limit the disk space had become.  Now some more disk space has been added to cater for these weekly “spikes”, and the customer has a better handle on the growth rate of the database and the effect of the necessary but heavy weekly database maintenance.

Having the ability to aggregate large quantities of data into a simplified overview is useful for some things, but you do need to consider the “story” you are trying to tell with the resulting numbers.  Instead of an average of a large set of numbers that lowered the effective number, perhaps the better aggregation would have been something like “the peak value per day”, or “the aggregated hour containing the peak value”.  A straight average in this case hid the issue from sight, even with a trend line to try and predict how things were moving.

Metron’s athene® makes visualizing data simple, quick and intuitive and helps to keep your systems running by giving you that “over the horizon” view of what’s coming up, helping you run IT systems with no capacity surprises and having time to think about the best solution for the way ahead.
http://www.metron-athene.com/products/athene/index.html
Nick Varley
Chief Services Officer

No comments:

Post a Comment