Looking at data and picking out the “story” it tells is often
as much an art as a science when it comes to Capacity Management. A gentle disbelieving of anything you are
told also often makes for a better and quicker outcome than taking everything
on face value. Here’s a recent anecdote.
A customer raised an issue with Metron that after a software
upgrade a database re-index job was taking a long time and that the application
using the database had been working really slowly. A hasty conference call / screen-sharing
session was set up with us, the customer, his boss, and a SQL Server database
administrator. The conversation started
with words to the effect of “this started after the upgrade, what’s going on?”
- so we looked and we talked for a little while, then it came out that the
database re-index job failed because it ran out of disk space. The next comment “has the database got bigger
because of the upgrade, then?” That’s
not our experience, but you never know….so we looked at a graph of the database
size over time with a nice simple trend line over the top – the customer had
already had this to hand. It looked a little like this:
With the disk size confirmed as 600 GB what was going on? This
clearly shows housekeeping of the database as it grows, is shrunk down, grows
again. Even the trend line appears to be
going down slightly. The upgrade
occurred in mid-April, so there was clearly no obvious jump up in database size
at that time.
The clue was the x-axis of the graph. The dates were just the beginning of each
month. The chart seemed to have a nice
neat shape to it – too neat, perhaps?
Looking further into the chart, the data was aggregated from
the original 15 minute intervals up to the average for an entire day. So what happens when we plot a chart of some
of the later data points, showing each interval instead of the aggregated ones?
When did the re-index job start?…at 07:00 on May 1st.
That’s
what polished off the remaining disk space.
The DBA killed the job and manually shrank the database at about 12:00. During the time the disk had become full, and
the application using this database detected this shortage of disk space and
shut itself down to avoid losing data.
Only when it was restarted, which was some time after the space had been
freed up, did it carry on.
Looking back at previous weekends the shape of the graph was
the same each time - gently rising disk usage with a sharp, usually short
increase during the time the re-index was taking place, dropping back to
previous levels afterwards.
The previous weeks had survived, just, in those cases, so
no-one had noticed how close the limit the disk space had become. Now some more disk space has been added to
cater for these weekly “spikes”, and the customer has a better handle on the
growth rate of the database and the effect of the necessary but heavy weekly
database maintenance.
Having the ability to aggregate large quantities of data
into a simplified overview is useful for some things, but you do need to consider
the “story” you are trying to tell with the resulting numbers. Instead of an average of a large set of
numbers that lowered the effective number, perhaps the better aggregation would
have been something like “the peak value per day”, or “the aggregated hour
containing the peak value”. A straight
average in this case hid the issue from sight, even with a trend line to try
and predict how things were moving.
Metron’s athene® makes visualizing data simple, quick and
intuitive and helps to keep your systems running by giving you that “over the
horizon” view of what’s coming up, helping you run IT systems with no capacity surprises
and having time to think about the best solution for the way ahead.
http://www.metron-athene.com/products/athene/index.html
Nick Varley
Chief Services Officer