Metron - Capacity Management: disk space

Tuesday, 31 May 2016

The perils of the wrong type of aggregation

Looking at data and picking out the “story” it tells is often as much an art as a science when it comes to Capacity Management. A gentle disbelieving of anything you are told also often makes for a better and quicker outcome than taking everything on face value. Here’s a recent anecdote.

A customer raised an issue with Metron that after a software upgrade a database re-index job was taking a long time and that the application using the database had been working really slowly. A hasty conference call / screen-sharing session was set up with us, the customer, his boss, and a SQL Server database administrator. The conversation started with words to the effect of “this started after the upgrade, what’s going on?” - so we looked and we talked for a little while, then it came out that the database re-index job failed because it ran out of disk space. The next comment “has the database got bigger because of the upgrade, then?” That’s not our experience, but you never know….so we looked at a graph of the database size over time with a nice simple trend line over the top – the customer had already had this to hand. It looked a little like this:

With the disk size confirmed as 600 GB what was going on? This clearly shows housekeeping of the database as it grows, is shrunk down, grows again. Even the trend line appears to be going down slightly. The upgrade occurred in mid-April, so there was clearly no obvious jump up in database size at that time.

The clue was the x-axis of the graph. The dates were just the beginning of each month. The chart seemed to have a nice neat shape to it – too neat, perhaps?

Looking further into the chart, the data was aggregated from the original 15 minute intervals up to the average for an entire day. So what happens when we plot a chart of some of the later data points, showing each interval instead of the aggregated ones?

When did the re-index job start?…at 07:00 on May 1st.

That’s what polished off the remaining disk space. The DBA killed the job and manually shrank the database at about 12:00. During the time the disk had become full, and the application using this database detected this shortage of disk space and shut itself down to avoid losing data. Only when it was restarted, which was some time after the space had been freed up, did it carry on.

Looking back at previous weekends the shape of the graph was the same each time - gently rising disk usage with a sharp, usually short increase during the time the re-index was taking place, dropping back to previous levels afterwards.

The previous weeks had survived, just, in those cases, so no-one had noticed how close the limit the disk space had become. Now some more disk space has been added to cater for these weekly “spikes”, and the customer has a better handle on the growth rate of the database and the effect of the necessary but heavy weekly database maintenance.

Having the ability to aggregate large quantities of data into a simplified overview is useful for some things, but you do need to consider the “story” you are trying to tell with the resulting numbers. Instead of an average of a large set of numbers that lowered the effective number, perhaps the better aggregation would have been something like “the peak value per day”, or “the aggregated hour containing the peak value”. A straight average in this case hid the issue from sight, even with a trend line to try and predict how things were moving.

Metron’s athene® makes visualizing data simple, quick and intuitive and helps to keep your systems running by giving you that “over the horizon” view of what’s coming up, helping you run IT systems with no capacity surprises and having time to think about the best solution for the way ahead.

http://www.metron-athene.com/products/athene/index.html

Nick Varley

Chief Services Officer

Wednesday, 17 February 2016

Key Metrics for Effective Storage Performance and Capacity Reporting - Two Distinct Aspects of Storage Capacity (2 of 10)

Today let’s take a look at the two distinct aspects of data storage.

Data can come from all different directions to the disk.

Disk occupancy

Disks used to be very expensive but now the costs have come down dramatically and this cost factor has accelerated the growth of storage.

You may have too little storage resulting in out of disk space problems but conversely you may have storage over-allocated. A lot of times people put excessive storage space out there to ensure that they never run out and don’t pay attention to how much they really need and what their growth really is going to be.

Below is a typical service center queuing diagram

Disk Performance Capacity

Response, IOPs

In many cases the requests are being sent out by an application or applications. There is a finite limitation on the requests per second that can be satisfied and then a queue begins to form. The queuing theory comes in to play where you have limitations on the throughput of your I/O and at some point this will have a response impact. The response impact transfers up through the application to the user and results in a slow response time, a performance problem.

On Friday I’ll be looking at space utilization, in the meantime why not sign up to our Community and get access to our great resources, free white papers, on-demand webinars and more.http://www.metron-athene.com/_resources/index.html

Dale Feiste

Principal Consultant