Metron - Capacity Management: May 2016

Tuesday, 31 May 2016

The perils of the wrong type of aggregation

Looking at data and picking out the “story” it tells is often as much an art as a science when it comes to Capacity Management. A gentle disbelieving of anything you are told also often makes for a better and quicker outcome than taking everything on face value. Here’s a recent anecdote.

A customer raised an issue with Metron that after a software upgrade a database re-index job was taking a long time and that the application using the database had been working really slowly. A hasty conference call / screen-sharing session was set up with us, the customer, his boss, and a SQL Server database administrator. The conversation started with words to the effect of “this started after the upgrade, what’s going on?” - so we looked and we talked for a little while, then it came out that the database re-index job failed because it ran out of disk space. The next comment “has the database got bigger because of the upgrade, then?” That’s not our experience, but you never know….so we looked at a graph of the database size over time with a nice simple trend line over the top – the customer had already had this to hand. It looked a little like this:

With the disk size confirmed as 600 GB what was going on? This clearly shows housekeeping of the database as it grows, is shrunk down, grows again. Even the trend line appears to be going down slightly. The upgrade occurred in mid-April, so there was clearly no obvious jump up in database size at that time.

The clue was the x-axis of the graph. The dates were just the beginning of each month. The chart seemed to have a nice neat shape to it – too neat, perhaps?

Looking further into the chart, the data was aggregated from the original 15 minute intervals up to the average for an entire day. So what happens when we plot a chart of some of the later data points, showing each interval instead of the aggregated ones?

When did the re-index job start?…at 07:00 on May 1st.

That’s what polished off the remaining disk space. The DBA killed the job and manually shrank the database at about 12:00. During the time the disk had become full, and the application using this database detected this shortage of disk space and shut itself down to avoid losing data. Only when it was restarted, which was some time after the space had been freed up, did it carry on.

Looking back at previous weekends the shape of the graph was the same each time - gently rising disk usage with a sharp, usually short increase during the time the re-index was taking place, dropping back to previous levels afterwards.

The previous weeks had survived, just, in those cases, so no-one had noticed how close the limit the disk space had become. Now some more disk space has been added to cater for these weekly “spikes”, and the customer has a better handle on the growth rate of the database and the effect of the necessary but heavy weekly database maintenance.

Having the ability to aggregate large quantities of data into a simplified overview is useful for some things, but you do need to consider the “story” you are trying to tell with the resulting numbers. Instead of an average of a large set of numbers that lowered the effective number, perhaps the better aggregation would have been something like “the peak value per day”, or “the aggregated hour containing the peak value”. A straight average in this case hid the issue from sight, even with a trend line to try and predict how things were moving.

Metron’s athene® makes visualizing data simple, quick and intuitive and helps to keep your systems running by giving you that “over the horizon” view of what’s coming up, helping you run IT systems with no capacity surprises and having time to think about the best solution for the way ahead.

http://www.metron-athene.com/products/athene/index.html

Nick Varley

Chief Services Officer

Monday, 23 May 2016

VMware Capacity Management

VMware is the go-to option for virtualization for many organizations, and has been for some time.
The longer it's been around, the more focus there is on making efficiency savings for the organization. This is where the Capacity Manager really needs to understand the technology, how to monitor it, and how to decide what headroom exists.

I'm running a VMware Capacity Management webinar this Wednesday May 25 (8am PDT, 9am MDT, 10am CDT, 11am EDT, 4pm UK, 5pm CEST) where I'll be taking a look at some of the key topics in understanding VMware Capacity.

Topics will include:

Why OS monitoring can be misleading
5 Key Metrics
Measuring Processor Capacity
Measuring Memory Capacity
Calculating Headroom in VMs

Look forward to seeing you there.

Dale Feiste

Principal Consultant

Friday, 20 May 2016

Top 5 Dont's for VMware

As promised today I’ll be dealing with the TOP 5 Don’ts for VMware.

DON’T

1) Overcommit CPU (unless ESX Host usage is less than 50%)

I’m sure that most of you have heard of CPU Ready Time. CPU Ready Time is the time spent (msecs) that a guest vCPUs are waiting to run on the ESX Hosts physical CPUs. This wait time can occur due to the co-scheduling constraints of operating systems and a higher CPU scheduling demand due to an overcommitted number of guest vCPUs against pCPUs. The likelihood is that if all the ESX hosts within your environment have on average a lower CPU usage demand, then overcommitting vCPUs to pCPUs is unlikely to see any significant rise in CPU Ready Time or impact on guest performance.

2) Overcommit virtual memory to the point of heavy memory reclamation on the ESX host.

Memory over-commitment is supported within your vSphere environment by a combination of Transparent Page Sharing, memory reclamation (Ballooning & Memory Compression) and vSwp files (Swapping). When memory reclamation takes place it incurs some memory management overhead and if DRS is enabled automatically, an increase in the number of vMotion migrations. Performance at this point can degrade due to the increase in overhead required to manage these operations.

3) Set CPU or Memory limits (unless absolutely necessary).

Do you really need to apply a restriction on usage to a guest or set of guests in a Resource Pool? By limiting usage, you may unwittingly restrict the performance of a guest. In addition, maintaining these limits incurs overhead, especially for memory, where the limits are enforced by Memory Reclamation. A better approach is to perform some proactive monitoring to identify usage patterns and peaks, then adjust the amount of CPU (MHz) and Memory (MB) allocated to your guest virtual machine. Where necessary guarantee resources by applying reservations.

4) Use vSMP virtual machines when running single-threaded workloads.

vSMP virtual machines have more than one vCPU assigned. A single-threaded workload running on your guest will not take advantage of those “extra” executable threads. Therefore extra CPU cycles used to schedule those vCPUs will be wasted.

5) Use 64-bit operating systems unless you are running 64-bit applications.

Whilst 64-bit operating systems are near enough the norm these days, do check that you need to use 64 bit as these require more memory overhead than 32-bit. Compare the benchmarks of 32/64-bit applications to determine whether it is necessary to use the 64-bit version.

We're running a webinar on VMware Capacity Management on May 25th visit our website and sign up to come along.

http://www.metron-athene.com/services/webinars/index.html

Jamie Baker

Principal Consultant

Wednesday, 18 May 2016

Top 5 Do’s for VMware

I’ve put together a quick list of the Top 5 Do’s and Don’ts for VMware which I hope you’ll find useful. I’ll start with the Top 5 Do’s

1) Select the correct operating system when creating your Virtual Machine. Why? The operating system type determines the optimal monitor mode to use, the optimal devices, such as the SCSI controller and the network adapter to use. It also specifies the correct version of VMware Tools to install.

2) Install VMware Tools on your Virtual Machine. Why? VMware Tools installs the Balloon Driver (vmmemctl.sys) which is used for virtual memory reclamation when an ESX host becomes imbalanced on memory usage, alongside optimized drivers and can enable Guest to Host Clock Synchronization to prevent Guest clock drift (Windows Only).

3) Keep vSwp files in their default location (with VM Disk files). Why? vSwp files are used to support overcommitted guest virtual memory on an ESX host. When a virtual machine is created, the vSwp file is created and its size is set to the amount of Granted Memory given to the virtual machine. Within a clustered environment, the files should be located within the shared VMFS datastore located on a FC SAN/iSCSI NAS. This is because of vMotion and the ability to migrate VM Worlds between hosts. If the vSwp files were stored on a local (ESX) datastore, when the associated guest is vMotioned to another host the corresponding vSwp file has to be copied to that host and can impact performance.

4) Disable any unused Guest CD or USB devices. Why? Because CPU cycles are being used to maintain these connections and you are effectively wasting these resources.

5) Select a guest operating system that uses fewer “ticks”. Why? To keep time, most operating system count periodic timer interrupts or “ticks”. Counting these ticks can be a real-time issue as ticks may not always be delivered on time or if a tick is lost, time falls behind. If this happens, ticks are backlogged and then the system delivers ticks faster to catch up. However, you can mitigate these issues by using guest operating systems which use fewer ticks. Windows (66Hz to 100Hz) or Linux (250Hz). It is also recommended to use NTP for Guest to Host Clock Synchronization, KB1006427.

On Friday I’ll go through the Top 5 Don'ts, in the meantime don't forget to register for our free 'VMware Capacity Planning' webinar

http://www.metron-athene.com/services/webinars/index.html

Jamie Baker

Principal Consultant

Monday, 16 May 2016

Idle VMs - Why should we care? (3 of 3)

Last week I looked at the impact idle VM’s can have on CPU utilization and memory overhead, today I’m going to look at the amount of Disk or Datastore space usage per Idle VM.

Each one will have associated VMDK (disk) files. The files are stored within a Datastore, which in most cases is hosted SAN or NAS storage and shared between the cluster host members. If VMDKs are provisioned as "Thick Disks" then the provisioned space is locked out within the Datastore for those disks.

To illustrate this an example of a least worst case scenario would be: 100 Windows idle VMs have been identified across the Virtual Infrastructure and each VM has a "Thick" single VMDK of 20GB used to house the operating system. This would then equate to 2TB of Datastore space being locked for use by VMs that are idle. You can expand this further by, making an assumption that some if not all VMs are likely to have more disks and of differing sizes.

The simple math will show you how much Datastore space is being wasted.

There is a counter to this, known as Thin Provisioning. By using Thin disks, in which the provisioned disk size is reserved but not locked you would not waste the same amount of space as you would by using Thick Disks. Using Thin Provisioning also has the added benefit of being able to over allocate disk space thus leading to a reduction in the amount of up front storage capacity required, but only incurring minimal overhead.

Idle VMs - why you should care.

Identifying Idle VMs, questioning whether they are required, finding out who owns them and removing them completely will reduce or help eliminate VM sprawl and help to improve the performance and capacity of the Virtual Infrastructure by:

· reducing unnecessary timer interrupts

· reducing allocated vCPUs

· reducing unnecessary CPU and Memory overhead

· reduce used Datastore space

· lead to more efficient use of your Virtual Infrastructure, including improved VM to Host ratios and reduction in additional hardware.

Hope that has helped you and don't forget to sign up for our VMware vSphere Capacity & Performance online workshop.

http://www.metron-athene.com/services/online-workshops/index.html

Jamie Baker

Principal Consultant

Friday, 13 May 2016

Idle VMs - Why should we care? (2 of 3)

In my previous blog I mentioned the term VM Sprawl and this is where Idle VMs are likely to factor.

Often VMs are provisioned to support short term projects, for development/test processes or for applications which have now been decommissioned. Now idle, they’re left alone, not bothering anyone and therefore not on the Capacity and Performance teams radar.

Which brings us back to the question. Idle VMs - Why should we care?

We should care, for a number of reasons but let's start with the impact on CPU utilization.

When VMs are powered on and running, timer interrupts have to be delivered from the host CPU to the VM. The total number of timer interrupts being delivered depends on the following factors:

· VMs running symmetric multiprocessing (SMP), hardware abstraction layers (HALs)/kernels require more timer interrupts than those running Uniprocessor HALs/Kernels.

· How many virtual CPUs (vCPUs) the VM has.

Delivering many virtual timer interrupts can negatively impact on the performance of the VM and can also increase host CPU consumption. This can be mitigated however, by reducing the number of vCPUs which reduces the timer interrupts and also the amount of co-scheduling overhead (check CPU Ready Time).

Then there's the Memory management of Idle VMs. Each powered on VM incurs Memory Overhead. The Memory Overhead includes space reserved for the VM frame buffer and various virtualization data structures, such as Shadow Page Tables (using Software Virtualization) or Nested Page Tables (using Hardware Virtualization). This also depends on the number of vCPUs and the configured memory granted to the VM.

We’ll have a look at a few more reasons to care on Monday, in the meantime why not complete our Capacity Management Maturity Survey and find out where you fall on the maturity scale. http://www.metron-athene.com/_capacity-management-maturity-survey/survey.asp

Jamie Baker

Principal Consultant

Wednesday, 11 May 2016

Idle VM's - Why should we care? (1 of 3)

The re-emergence of Virtualization technologies, such as VMware, Microsoft's Hyper-V, Xen and Linux KVM has provided organizations with the tools to create new operating system platforms ready to support the services required by the business, in minutes rather than days.

Indeed IT itself is a service to the business.

In more recent times, Cloud computing which in itself is underpinned by Virtualization, makes use of the functionality provided to satisfy :

on-demand resources
the ability to provision faster
rapid elasticity (refer to NIST 's description of Cloud Computing)

Cloud computing makes full use of the underlying clustered hardware. Constant strides are being made by Virtualization vendors to improve the Virtual Machine (VM) to Host ratio, without affecting the underlying performance.

But, you may ask "What's this got to do with Idle VMs?"

Well, as I described earlier Virtualization provides the means to easily and quickly provision virtual systems. Your CTO/CIO is going to demand a significant ROI once an investment in both the hardware and virtualization software has been made, possibly switching the focus to an increase in the VM to Host ratio.

“What's wrong with that?” I hear you say. Nothing at all, as long as you keep track of what VMs you are provisioning and:

what resources you have granted
what they are for

Failure to do so will mean that your quest for a good ROI and a satisfied Chief will be in jeopardy, as you’ll encounter a term most commonly known as VM Sprawl.

More about this on Friday.

In the meantime why not register for our webinar VMware Capacity Management' taking place on May 25.

http://www.metron-athene.com/services/webinars/index.html

Jamie Baker

Principal Consultant