Metron - Capacity Management: July 2014

Thursday 31 July 2014

Top 5 Don’ts for VMware

As promised today I’ll be dealing with the TOP 5 Don’ts for VMware.

DON’T

1) CPU Overcommit (unless over ESX Host usage is <50 span="">

Why? I’m sure that most of you would have heard of CPU Ready Time? CPU Ready Time is the time spent (msecs) that a guest vCPUs are waiting run on the ESX Hosts physical CPUs. This wait time can occur due to the co-scheduling constraints of operating systems and a higher CPU scheduling demand due to an overcommitted number of guest vCPUs against pCPUs. The likelihood is that if all the ESX hosts within your environment have on average a lower CPU usage demand, then overcommitting vCPUs to pCPUs is unlikely to see any significant rise in CPU Ready Time or impact on guest performance.

2) Overcommit virtual memory to the point of heavy memory reclamation on the ESX host.

Why? Memory over-commitment is supported within your vSphere environment by a combination of Transparent Page Sharing, memory reclamation (Ballooning & Memory Compression) and vSwp files (Swapping). When memory reclamation takes place it incurs some memory management overhead and if DRS is enabled automatically, an increase in the number of vMotion migrations. Performance at this point can degrade due to the increase in overhead required to manage these operations.

3) Set CPU or Memory limits (unless absolutely necessary).

Why? Do you really need to apply a restriction on usage to a guest or set of guests in a Resource Pool? By limiting usage, you may unwittingly restrict the performance of a guest. In addition, maintaining these limits incurs overhead, especially for memory, where the limits are enforced by Memory Reclamation. A better approach is to perform some proactive monitoring to identify usage patterns and peaks, then adjust the amount of CPU (MHz) and Memory (MB) allocated to your guest virtual machine. Where necessary guarantee resources by applying reservations.

4) Use vSMP virtual machines when running single-threaded workloads.

Why? vSMP virtual machines have more than one vCPU assigned. A single-threaded workload running on your guest will not take advantage of those “extra” executable threads. Therefore extra CPU cycles used to schedule those vCPUs will be wasted.

5) Use 64-bit operating systems unless you are running 64-bit applications.

Why? 64-bit virtual machines require more memory overhead than 32-bit ones. Compare the benchmarks of 32/64-bit applications to determine whether it is necessary to use the 64-bit version.

I recently ran a webinar on VMware memory ‘VMware vSphere Taking a Trip down Memory Lane’ visit our website and sign up to be part of our community to download this on-demand http://www.metron-athene.com/_downloads/on-demand-webinars/index_2.asp

Jamie Baker

Principal Consultant

Tuesday 29 July 2014

Top 5 Do’s for VMware

I’ve put together a quick list of the Top 5 Do’s and Don’ts for VMware which I hope you’ll find useful.

Today I’m starting with the Top 5 Do’s

1) Select the correct operating system when creating your Virtual Machine.

Why? The operating system type determines the optimal monitor mode to use, the optimal devices, such as the SCSI controller and the network adapter to use. It also specifies the correct version of VMware Tools to install.

2) Install VMware Tools on your Virtual Machine.

Why? VMware Tools installs the Balloon Driver (vmmemctl.sys) which is used for virtual memory reclamation when an ESX host becomes imbalanced on memory usage, alongside optimized drivers and can enable Guest to Host Clock Synchronization to prevent Guest clock drift (Windows Only).

3) Keep vSwp files in their default location (with VM Disk files).

Why? vSwp files are used to support overcommitted guest virtual memory on an ESX host. When a virtual machine is created, the vSwp file is created and its size is set to the amount of Granted Memory given to the virtual machine. Within a clustered environment, the files should be located within the shared VMFS datastore located on a FC SAN/iSCSI NAS. This is because of vMotion and the ability to migrate VM Worlds between hosts. If the vSwp files were stored on a local (ESX) datastore, when the associated guest is vMotioned to another host the corresponding vSwp file has to be copied to that host and can impact performance.

4) Disable any unused Guest CD or USB devices.

Why? Because CPU cycles are being used to maintain these connections and you are effectively wasting these resources.

5) Select a guest operating system that uses fewer “ticks”.

Why? To keep time, most operating system count periodic timer interrupts or “ticks”. Counting these ticks can be a real-time issue as ticks may not always be delivered on time or if a tick is lost, time falls behind. If this happens, ticks are backlogged and then the system delivers ticks faster to catch up. However, you can mitigate these issues by using guest operating systems which use fewer ticks. Windows (66Hz to 100Hz) or Linux (250Hz). It is also recommended to use NTP for Guest to Host Clock Synchronization, KB1006427.

On Thursday I’ll go through the Top 5 Don’ts.

If you want more detailed information on performance and capacity management of VMware why not visit our website and sign up to be part of our community? Being a community member provides you with free access to to my on-demand webinars on Vmware http://www.metron-athene.com/_downloads/index.html

Jamie Baker

Principal Consultant

Monday 21 July 2014

Idle VMs - Why should we care? (3 of 3)

Earlier in the week I looked at the impact idle VM’s can have on CPU utilization and memory overhead today I’m going to look at the amount of Disk or Datastore space usage per Idle VM.

Each one will have associated VMDK (disk) files. The files are stored within a Datastore, which in most cases is hosted SAN or NAS storage and shared between the cluster host members. If VMDKs are provisioned as "Thick Disks" then the provisioned space is locked out within the Datastore for those disks.

To illustrate this an example of a least worst case scenario would be: 100 Windows idle VMs have been identified across the Virtual Infrastructure and each VM has a "Thick" single VMDK of 20GB used to house the operating system. This would then equate to 2TB of Datastore space being locked for use by VMs that are idle. You can expand this further by, making an assumption that some if not all VMs are likely to have more disks and of differing sizes.

The simple math will show you how much Datastore space is being wasted.

There is a counter to this, known as Thin Provisioning. By using Thin disks, in which the provisioned disk size is reserved but not locked you would not waste the same amount of space as you would by using Thick Disks. Using Thin Provisioning also has the added benefit of being able to over allocate disk space thus leading to a reduction in the amount of up front storage capacity required, but only incurring minimal overhead.

Idle VMs - why you should care.

Identifying Idle VMs, questioning whether they are required, finding out who owns them and removing them completely will reduce or help eliminate VM sprawl and help to improve the performance and capacity of the Virtual Infrastructure by:

· reducing unnecessary timer interrupts

· reducing allocated vCPUs

· reducing unnecessary CPU and Memory overhead

· reducing used Datastore space

· leading to more efficient use of your Virtual Infrastructure, including improved VM to Host ratios and reduction in additional hardware

So start looking at yours now.

I'll be discussing VMware Memory at my webinar this Wednesday,it's free to attend so come along

http://www.metron-athene.com/services/training/webinars/index.html

Jamie Baker

Principal Consultant

Friday 18 July 2014

Idle VMs - Why should we care? (2 of 3)

In Wednesday's blog I mentioned the term VM Sprawl and this is where Idle VMs are likely to factor.

Often VMs are provisioned to support short term projects, for development/test processes or for applications which have now been decommissioned. Now idle, they’re left alone, not bothering anyone and therefore not on the Capacity and Performance teams radar.

Which brings us back to the question. Idle VMs - Why should we care?

We should care, for a number of reasons but let's start with the impact on CPU utilization.

When VMs are powered on and running, timer interrupts have to be delivered from the host CPU to the VM. The total number of timer interrupts being delivered depends on the following factors:

· VMs running symmetric multiprocessing (SMP), hardware abstraction layers (HALs)/kernels require more timer interrupts than those running Uniprocessor HALs/Kernels.

· How many virtual CPUs (vCPUs) the VM has.

Delivering many virtual timer interrupts can negatively impact on the performance of the VM and can also increase host CPU consumption. This can be mitigated however, by reducing the number of vCPUs which reduces the timer interrupts and also the amount of co-scheduling overhead (check CPU Ready Time).

Then there's the Memory management of Idle VMs. Each powered on VM incurs Memory Overhead. The Memory Overhead includes space reserved for the VM frame buffer and various virtualization data structures, such as Shadow Page Tables (using Software Virtualization) or Nested Page Tables (using Hardware Virtualization). This also depends on the number of vCPUs and the configured memory granted to the VM.

We’ll have a look at a few more reasons to care on Monday. In the meantime sign up and come along to my VMware Memory webinar next week http://www.metron-athene.com/services/training/webinars/index.html

Jamie Baker

Principal Consultant

Wednesday 16 July 2014

Idle VMs - Why should we care? (1 of 3)

The re-emergence of Virtualization technologies, such as VMware, Microsoft's Hyper-V, Xen and Linux KVM has provided organizations with the tools to create new operating system platforms ready to support the services required by the business, in minutes rather than days.

Indeed IT itself is a service to the business.

In more recent times, Cloud computing which in itself is underpinned by Virtualization, makes use of the functionality provided to satisfy

· on-demand resources

· the ability to provision faster

· rapid elasticity (refer to NIST 's description of Cloud Computing)

Cloud computing makes full use of the underlying clustered hardware. Constant strides are being made by Virtualization vendors to improve the Virtual Machine (VM) to Host ratio, without affecting the underlying performance.

But, you may ask "What's this got to do with Idle VMs?"

Well, as I described earlier Virtualization provides the means to easily and quickly provision virtual systems. Your CTO/CIO is going to demand a significant ROI once an investment in both the hardware and virtualization software has been made, possibly switching the focus to an increase in the VM to Host ratio.

“What's wrong with that?” I hear you say. Nothing at all, as long as you keep track of what VMs you are provisioning and :

· what resources you have granted

· what they are for

Failure to do so will mean that your quest for a good ROI and a satisfied Chief will be in jeopardy, as you’ll encounter a term most commonly known as VM Sprawl.

More about this on Friday. In the meantime why not register for my webinar 'Taking a trip down VMware vSphere Memory Lane'

http://www.metron-athene.com/services/training/webinars/index.html

Jamie Baker

Principal Consultant

Monday 14 July 2014

Trending Clusters - 5 Top Performance and Capacity Concerns for VMware

I tend to trend on Clusters the most.

VMs and Resource Pools have soft limits so they are the easiest and quickest to change.

Want to know when you’ll run out of capacity?

– The hardware is the limit

– Trend hardware utilization

The graph below shows a trend on 5 minute data for average CPU and shows a nice flat trend.

If I take the same data and trend on the peak hour then I see a difference.

You can see that the trend has a steady increase, the peaks are getting larger.

When trending ensure that you trend to cope with the peaks, to deliver immediate value, as these are what you need to cope with.

Next let’s look at aggregating the data. Previously we looked at Ready Time and as I said Ready Time is accumulated against a virtual machine but you can aggregate this data to see what is going on in the Cluster as a whole.

In the example below CPU utilization is not that busy but there is a steady increase in Ready Time.

The dynamic may be changing and new VM’s that are being created have more CPU’s, which could eventually cause a problem.

That concludes my series and I hope that you'll sign up and come along to our free VMware memory webinar on July 23rd
http://www.metron-athene.com/services/training/webinars/index.html

Phil Bell

Consultant

Friday 11 July 2014

Storage Latency - Top 5 Performance and Capacity concerns for Vmware

As mentioned on Wednesday If it isn’t memory that people are talking to us about then it is storage.

It's not advisable to look at kb per second or I/O at the OS because, as previously outlined, time slicing can skew these figures so it is better to look at these from VMware.

In terms of latency there is a lot more detail in VMware, where you can look at latency from the device and from the kernel.

Kernel

The graph below looks at the individual CPU’s on the hosts.

You might ask why am I looking at CPU? If there is a spike in latency on the kernel side it is worthwhile to look at CPU on processor 0. ESX runs certain processes on Processor 0 and this can have an impact on anything going through that kernel.

Latency

With Latency, to get a good picture, it is worthwhile to look at:

– Device latency

– Kernal latency

– Total latency

These are shown together on the graph below.

I'll be concluding my series on Monday with a look at Trending Clusters. In the meantime there's still time to register and come along to Jamie's webinar on VMware memory on July 23rd http://www.metron-athene.com/services/training/webinars/index.html

Phil Bell

Consultant

Wednesday 9 July 2014

Cluster memory - Top 5 Performance and Capacity concerns for VMware

Continuing with VMware memory the next place to look at for memory issues is at the Cluster.

It is useful to look at:

– Average memory usage of total memory available

– Average amount of memory used by memory control

– Average memory shared across the VM’s

– Average swap space in use

In the graph below we can see that when the shared memory drops the individual memory usage increases.

In addition to that swapping and memory control increased at the same time.

If it isn’t memory that people are talking to us about then it is storage. I'll be looking at number 4 of the Top 5 performance and capacity concerns for VMware on Friday when we take a look at Storage latency.

Phil Bell

Consultant

Monday 7 July 2014

Monitoring Memory - 5 Top Performance and Capacity Concerns for VMware

Memory still seems to be the item that prompts most upgrades, with VM’s running out of memory before running out of vCPU.

It’s not just a question of how much of it is being used as there are different ways of monitoring it. Some of the things that you are going to need to consider are:

– Reservations

– Limits

– Ballooning

– Shared Pages

– Active Memory

– Memory Available for VMs

VM Memory Occupancy

In terms of occupancy the sorts of things that you will want to look at are:

– Average Memory overhead

– Average Memory used by the VM(active memory)

– Average Memory shared

– Average amount of host memory consumed by the VM

– Average memory granted to the VM

In the instance below we can see that the pink area is active memory and we can note that the average amount of host memory used by this VM increases at certain points in the chart.

VM Memory Performance

It is useful to produce a performance graph for memory where you can compare:

– Average memory reclaimed

– Average memory swapped

– Memory limit

– Memory reservation

– Average amount of host memory consumed.

An illustration of this is shown below.

In this instance we can see that this particular VM had around 2.5gb of memory ‘stolen’ from it by the balloon driver (vmmemctrl), at the same time swapping was occurring and this could cause performance problems.

On Wednesday I'll be discussing cluster memory and don't forget to sign up for Jamie's free webinar on VMware memory Taking a Trip down VMware vSphere Memory Lane http://www.metron-athene.com/services/training/webinars/index.html

Phil Bell

Consultant

Friday 4 July 2014

Ready Time - 5 Top Performance and Capacity Concerns for VMware

On Wednesday we looked at the 3 states which a VM can be in.

Threads – being processed and allocated to a thread.

Ready – in a ready state where they wish to process but aren’t able to.

Idle – where they exist but don’t need to be doing anything at this time.

It makes sense to reduce contention by having as few vCPUs as possible in each VM.

In the diagram below you can see that work has moved over the threads to be processed and there is some available headroom. Work that is waiting to be processed requires 2 CPU’s so is unable to fit and creates wasted space that we are unable to use at this time

We need to remove a VM before we can put a 2 CPU VM on to a thread and remain 100% busy.

In the meantime other VM’s are coming along and we now have a 4vCPU VM accumulating Ready Time.

2 VM’s move off but the 4vCPU VM waiting cannot move on as there are not enough vCPU’s available.

It has to wait and other work moves ahead of it to process.

Even when 3vCPU’s are available it is still unable to process and will be ‘queue jumped’ by other VM’s who require less vCPU’s.

Hopefully that is a clear illustration of why it makes sense to reduce contention by having as few vCPUs as possible in each VM.

Ready Time impacts on performance and needs to be monitored. On Monday I'll be looking at monitoring memory - have a safe July 4th.

Phil Bell

Consultant

Wednesday 2 July 2014

Ready Time - Top 5 performance & capacity concerns for VMware

Imagine you are driving a car, and you are stationary, there could be several reasons for this. You may be waiting to pick someone up, you may have stopped to take a phone call, or it might be that you have stopped at a red light. The first two of these (pick up, phone) you have decided to stop the car to perform a task. In the third instance the red light is stopping you doing something you want to do. In fact you spend the whole time at the red light ready to move away as soon as the light turns to green. That time is ready time.

When a VM wants to use the processor, but is stopped from doing so. It accumulates ready time and this has a direct impact on performance.

For any processing to happen all the vCPUs assigned to the VM must be running at the same time. This means if you have a 4 vCPU all 4 need available cores or hyperthreads to run. So the fewer vCPUs a VM has, the more likely it is to be able to get onto the processors.

To avoid Ready Time

You can reduce contention by having as few vCPUs as possible in each VM. If you monitor CPU Threads, vCPUs and Ready Time you’ll be able to see if there is a correlation between increasing vCPU numbers and Ready Time in your systems.

Proportion of Time: 4 vCPU VM

Below is an example of a 4vCPU VM, each doing about 500 seconds worth of real CPU time and about a 1000’s worth of Ready Time.

For every 1 second of processing the VM is waiting around 2 seconds to process, so it’s spending almost twice as long to process than it is processing. This is going to impact on the performance being experienced by the end user who is reliant on this VM.

Now let’s compare that to the proportion of time spent processing on a 2 vCPU VM. The graph below shows a 2 vCPU VM doing the same amount of work, around 500 seconds worth of real CPU time and as you can see the Ready Time is significantly less.

There are 3 states which the VM can be in:

Threads – being processed and allocated to a thread.

Ready – in a ready state where they wish to process but aren’t able to.

Idle – where they exist but don’t need to be doing anything at this time.

On Friday I'll look at how those threads process, in the meantime sign up for our VMware capacity & performance workshop http://www.metron-athene.com/services/training/online-workshops/index.html

Phil Bell

Consultant