Metron - Capacity Management: June 2015

Monday, 22 June 2015

Top 5 Don’ts for VMware

As promised today I’ll be dealing with the TOP 5 Don’ts for VMware.

DON’T

1) Overcommit CPU (unless ESX Host usage is less than 50%)

I’m sure that most of you have heard of CPU Ready Time. CPU Ready Time is the time spent (msecs) that a guest vCPUs are waiting to run on the ESX Hosts physical CPUs. This wait time can occur due to the co-scheduling constraints of operating systems and a higher CPU scheduling demand due to an overcommitted number of guest vCPUs against pCPUs. The likelihood is that if all the ESX hosts within your environment have on average a lower CPU usage demand, then overcommitting vCPUs to pCPUs is unlikely to see any significant rise in CPU Ready Time or impact on guest performance.

2) Overcommit virtual memory to the point of heavy memory reclamation on the ESX host.

Memory over-commitment is supported within your vSphere environment by a combination of Transparent Page Sharing, memory reclamation (Ballooning & Memory Compression) and vSwp files (Swapping). When memory reclamation takes place it incurs some memory management overhead and if DRS is enabled automatically, an increase in the number of vMotion migrations. Performance at this point can degrade due to the increase in overhead required to manage these operations.

3) Set CPU or Memory limits (unless absolutely necessary).

Do you really need to apply a restriction on usage to a guest or set of guests in a Resource Pool? By limiting usage, you may unwittingly restrict the performance of a guest. In addition, maintaining these limits incurs overhead, especially for memory, where the limits are enforced by Memory Reclamation. A better approach is to perform some proactive monitoring to identify usage patterns and peaks, then adjust the amount of CPU (MHz) and Memory (MB) allocated to your guest virtual machine. Where necessary guarantee resources by applying reservations.

4) Use vSMP virtual machines when running single-threaded workloads.

vSMP virtual machines have more than one vCPU assigned. A single-threaded workload running on your guest will not take advantage of those “extra” executable threads. Therefore extra CPU cycles used to schedule those vCPUs will be wasted.

5) Use 64-bit operating systems unless you are running 64-bit applications.

Whilst 64-bit operating systems are near enough the norm these days, do check that you need to use 64 bit as these require more memory overhead than 32-bit. Compare the benchmarks of 32/64-bit applications to determine whether it is necessary to use the 64-bit version.

I'm running a webinar on VMware memory ‘Taking a Trip down vSphere Memory Lane’ this Wednesday, visit our website and sign up to come along

http://www.metron-athene.com/services/webinars/index.html

Jamie Baker

Principal Consultant

Friday, 19 June 2015

Top 5 Do’s for VMware

I’ve put together a quick list of the Top 5 Do’s and Don’ts for VMware which I hope you’ll find useful.

Today I’m starting with the Top 5 Do’s

DO

1) Select the correct operating system when creating your Virtual Machine. Why? The operating system type determines the optimal monitor mode to use, the optimal devices, such as the SCSI controller and the network adapter to use. It also specifies the correct version of VMware Tools to install.

2) Install VMware Tools on your Virtual Machine. Why? VMware Tools installs the Balloon Driver (vmmemctl.sys) which is used for virtual memory reclamation when an ESX host becomes imbalanced on memory usage, alongside optimized drivers and can enable Guest to Host Clock Synchronization to prevent Guest clock drift (Windows Only).

3) Keep vSwp files in their default location (with VM Disk files). Why? vSwp files are used to support overcommitted guest virtual memory on an ESX host. When a virtual machine is created, the vSwp file is created and its size is set to the amount of Granted Memory given to the virtual machine. Within a clustered environment, the files should be located within the shared VMFS datastore located on a FC SAN/iSCSI NAS. This is because of vMotion and the ability to migrate VM Worlds between hosts. If the vSwp files were stored on a local (ESX) datastore, when the associated guest is vMotioned to another host the corresponding vSwp file has to be copied to that host and can impact performance.

4) Disable any unused Guest CD or USB devices. Why? Because CPU cycles are being used to maintain these connections and you are effectively wasting these resources.

5) Select a guest operating system that uses fewer “ticks”. Why? To keep time, most operating system count periodic timer interrupts or “ticks”. Counting these ticks can be a real-time issue as ticks may not always be delivered on time or if a tick is lost, time falls behind. If this happens, ticks are backlogged and then the system delivers ticks faster to catch up. However, you can mitigate these issues by using guest operating systems which use fewer ticks. Windows (66Hz to 100Hz) or Linux (250Hz). It is also recommended to use NTP for Guest to Host Clock Synchronization, KB1006427.

On Monday I’ll go through the Top 5 Don'ts.

If you want more detailed information on performance and capacity management of VMware why not visit our website and sign up to be part of our community? Being a community member provides you with free access to our library of white papers and on-demand webinars. http://www.metron-athene.com/_resources/index.html

Jamie Baker

Principal Consultant

Wednesday, 17 June 2015

Top 10 VMware Metrics to help pinpoint bottlenecks

Top 10 VMware metrics list to assist you in pinpointing performance bottlenecks within your VMware vSphere virtual infrastructure. I hope you find these useful.

CPU

1. Ave CPU Usage in MHz - this metric should be reported for both host and guest levels. Because a guest (VM) has to run on an ESX host, that ESX host has a finite limit of resource. High CPU Usage at the host level could indicate a bottleneck, however create a breakdown of all guests hosted to give a clear indication of who is using the most. If you have enabled DRS on your cluster, you may see a rise in the number of vMotions as DRS attempts to load balance.

2. CPU Ready Time - This is an important metric that gives a clear indication of CPU Overcommitment within your VMware Virtual Infrastructure. CPU Overcommitment can lead to significant CPU performance problems due to the way in which ESX CPU schedules Virtual CPU (vCPU) work onto Physical CPUs (pCPUs). This is reported at the guest level. Any values reported in seconds can indicate that you have provisioned too many vCPUs for this guest. Look at all the vCPUs assigned to all hosted guests and then the number of Physical CPUs available on the host(s) to see whether you have overcommitted the CPU.

Memory

3. Ave Memory Usage in KB - similar to Average CPU Usage, this should be reported at both Host and Guest levels. It can give you an indication in terms of who is using the most memory but high usage does not necessarily indicate a bottleneck. If Memory Usage is high, look at the values reported for Memory Ballooning/Swapping.

4. Balloon KB - values reported for the balloon indicate that the Host cannot meet its Memory requirements and is an early warning sign of memory pressure on the Host. The Balloon driver is installed via VMware Tools onto Windows and Linux guests and its job is to force the operating system, of lightly used guests, to page out unused memory back to ESX so it can satisfy the demand from hungrier guests.

5. Swap Used KB - if you see values being reported at the Host for Swap, this indicates that memory demands cannot be satisfied and processes are swapped out to the vSwp file. This is ‘Bad’. Guests may or will have to be migrated to other hosts or more memory will need to be added to this host to satisfy the memory demands of the guests.

6. Consumed - Consumed memory is the amount of Memory Granted on a Host to its guests minus the amount of Memory Shared across them. Memory can be over-allocated, unlike CPU, by sharing common memory pages such as Operating System pages. This metric displays how much Host Physical Memory is actually being used (or consumed) and includes usage values for the Service Console and VMkernel.

7. Active - this metric reports the amount of physical memory recently used by the guests on the Host and is displayed as “Guest Memory Usage” in vCenter at Guest level.

Disk I/O

8. Queue Latency - this metric measures the average amount of time taken per SCSI command in the VMkernel queue. This value must always be zero. If not, it indicates that the workload is too high and the storage array cannot process the data fast enough.

9. Kernel Latency - this metric measures the average amount of time, in milliseconds, that the VMkernel spends processing each SCSI command. For best performance, the value should be between 0-1 milliseconds. If the value is greater than 4ms, the virtual machines on the Host are trying to send more throughput to the storage system than the configuration supports. If this is the case, check the CPU usage, and increase the queue depth or storage.

10. Device Latency - this metric measures the average amount of time, in milliseconds, to complete a SCSI command from the physical device. Depending on your hardware, a number greater than 15ms indicates there are probably problems with the storage array. Again if this is the case, move the active VMDK to a volume with more spindles or add more disks to the LUN.

Note: Please be aware when reporting usage values, you take into consideration any child resource pools specified with CPU/Memory limits and report accordingly.

I'm running a webinar 'Taking a trip down vSphere Memory Lane' on June 24 don't forget to register and come along http://www.metron-athene.com/services/webinars/index.html

Jamie Baker

Principal Consultant

Monday, 15 June 2015

5 Top Performance and Capacity Concerns for VMware - Trending Clusters

I tend to trend on Clusters the most.

VMs and Resource Pools have soft limits so they are the easiest and quickest to change.

Want to know when you’ll run out of capacity?

– The hardware is the limit

– Trend hardware utilization

The graph below shows a trend on 5 minute data for average CPU and shows a nice flat trend.

If I take the same data and trend on the peak hour then I see a difference.

You can see that the trend has a steady increase, the peaks are getting larger.

When trending ensure that you trend to cope with the peaks, to deliver immediate value, as these are what you need to cope with.

Next let’s look at aggregating the data. Previously we looked at Ready Time and as I said Ready Time is accumulated against a virtual machine but you can aggregate this data to see what is going on in the Cluster as a whole.

In the example below CPU utilization is not that busy but there is a steady increase in Ready Time.

The dynamic may be changing and new VM’s that are being created have more CPU’s, which could eventually cause a problem.

I hope the series has been helpful and I conclude with a last reminder to register for Jamie's webinar 'Taking a trip down vSphere Memory Lane'

http://www.metron-athene.com/services/webinars/index.html

Phil Bell

Consultant

Friday, 12 June 2015

5 Top Performance and Capacity Concerns for VMware - Storage Latency

If it isn’t memory that people are talking to us about then it's storage.

Again, it's not advisable to look at kb per second or I/O at the OS. As previously outlined time slicing can skew these figures so it is better to look at these from VMware.

In terms of latency there is a lot more detail in Vmware. You can look at latency from the device and from the kernel.

Kernel

The graph below looks at the individual CPU’s on the hosts.

You might ask why am I looking at CPU? If there is a spike in latency on the kernel side it is worthwhile to look at CPU on processor 0. ESX runs certain processes on Processor 0 and this can have an impact on anything going through that kernel.

Latency

It is worthwhile to look at:

– Device latency

– Kernal latency

– Total latency

Shown on the graph below.

On Monday I’ll finish my series by looking at Trending Clusters.

Phil Bell

Consultant

Wednesday, 10 June 2015

5 Top Performance and Capacity Concerns for VMware - Monitoring Memory

Memory still seems to be the item that prompts most upgrades, with VM’s running out of memory before running out of vCPU.

It’s not just a question of how much of it is being used as there are different ways of monitoring it. Some of the things that you are going to need to consider are:

– Reservations

– Limits

– Ballooning

– Shared Pages

– Active Memory

– Memory Available for VMs

VM Memory Occupancy

In terms of occupancy the sorts of things that you will want to look at are:

– Average Memory overhead

– Average Memory used by the VM(active memory)

– Average Memory shared

– Average amount of host memory consumed by the VM

– Average memory granted to the VM

In this instance we can see that the pink area is active memory and we can note that the average amount of host memory used by this VM increases at certain points in the chart.

VM Memory Performance

It is useful to produce a performance graph for memory where you can compare:

– Average memory reclaimed

– Average memory swapped

– Memory limit

– Memory reservation

– Average amount of host memory consumed.

As illustrated below.

In this instance we can see that this particular VM had around 2.5gb of memory ‘stolen’ from it by the balloon driver (vmmemctrl), at the same time swapping was occurring and this could cause performance problems.

Cluster Memory

The next place to look at for memory issues is at the Cluster.

It is useful to look at:

– Average memory usage of total memory available

– Average amount of memory used by memory control

– Average memory shared across the VM’s

– Average swap space in use

In the graph below we can see that when the shared memory drops the individual memory usage increases.

In addition to that swapping and memory control increased at the same time.

On Friday join me as I discuss storage latency. In the meantime we've got some great white papers and webinars on VMware Capacity Management - join our Community and get free access to them http://www.metron-athene.com/_resources/index.html

Phil Bell

Consultant

Monday, 8 June 2015

Top 5 Performance and Capacity Concerns for VMware - Time Slicing and Ready Time

The effect we saw between the OS and VMware, in my blog on Friday, is caused by time slicing.

In a typical VMware host we have more vCPUs assigned to VMs than we do physical cores. The processing time of the cores has to be shared among the vCPUs. Cores are shared between vCPUs in time slices, 1 vCPU to 1 core at any point in time.

More vCPUs lead to more time slicing. The more vCPUs we have the less time each can be on the core, and therefore the slower time passes for that VM. To keep the VM in time extra time interrupts are sent in quick succession. So time passes slowly and then very fast.

More time slicing equals less accurate data from the OS.

Anything that doesn’t relate to time, such as disc occupancy should be ok to use.

Ready Time

Imagine you are driving a car, and you are stationary, there could be several reasons for this. You may be waiting to pick someone up, you may have stopped to take a phone call, or it might be that you have stopped at a red light. The first two of these (pick up, phone) you have decided to stop the car to perform a task. In the third instance the red light is stopping you doing something you want to do. In fact you spend the whole time at the red light ready to move away as soon as the light turns to green. That time is ready time.

When a VM wants to use the processor, but is stopped from doing so. It accumulates ready time and this has a direct impact on performance.

For any processing to happen all the vCPUs assigned to the VM must be running at the same time. This means if you have a 4 vCPU all 4 need available cores or hyperthreads to run. So the fewer vCPUs a VM has, the more likely it is to be able to get onto the processors.

To avoid Ready Time

You can reduce contention by having as few vCPUs as possible in each VM. If you monitor CPU Threads, vCPUs and Ready Time you’ll be able to see if there is a correlation between increasing vCPU numbers and Ready Time in your systems.

Proportion of Time: 4 vCPU VM

Below is an example of a 4vCPU VM, each doing about 500 seconds worth of real CPU time and about a 1000’s worth of Ready Time.

For every 1 second of processing the VM is waiting around 2 seconds to process, so it’s spending almost twice as long to process than it is processing. This is going to impact on the performance being experienced by the end user who is reliant on this VM.

Now let’s compare that to the proportion of time spent processing on a 2 vCPU VM. The graph below shows a 2 vCPU VM doing the same amount of work, around 500 seconds worth of real CPU time and as you can see the Ready Time is significantly less.

There are 3 states which the VM can be in:

Threads – being processed and allocated to a thread.

Ready – in a ready state where they wish to process but aren’t able to.

Idle – where they exist but don’t need to be doing anything at this time.

In the diagram below you can see that work has moved over the threads to be processed and there is some available headroom. Work that is waiting to be processed requires 2 CPU’s so is unable to fit and creates wasted space that we are unable to use at this time.

We need to remove a VM before we can put a 2 CPU VM on to a thread and remain 100% busy.

In the meantime other VM’s are coming along and we now have a 4vCPU VM accumulating Ready Time.

2 VM’s moves off but the 4vCPU VM waiting cannot move on as there are not enough vCPU’s available.

It has to wait and other work moves ahead of it to process.

Even when 3vCPU’s are available it is still unable to process and will be ‘queue jumped’ by other VM’s who require less vCPU’s.

Hopefully that is a clear illustration of why it makes sense to reduce contention by having as few vCPUs as possible in each VM.

Ready Time impacts on performance and needs to be monitored.

On Wednesday I'll be looking at Memory performance, in the meantime don't forget to register for our 'Taking a trip down vSphere Memory Lane' webinar taking place on June 24th

http://www.metron-athene.com/services/webinars/index.html

Phil Bell

Consultant

Friday, 5 June 2015

5 Top Performance and Capacity Concerns for VMware

Jamie will be hosting our webinar Taking a Trip down VMware vSphere Memory Lane on June 24th
http://www.metron-athene.com/services/webinars/index.html so I thought it would be pertinent to take a look at the Top 5 Performance and Capacity Concerns for VMware in my blog series.

I’ll begin with Dangers with OS Metrics.

Almost every time we discuss data capture for VMware, we’ll be asked by someone if we can capture the utilization of specific VMs, by monitoring the OS. The simple answer is no.

In the example below the operating system sees that VM1 is busy 50% of the time but VMware sees is that it was only there for half of half the time and accordingly reports that it is 25% busy.

Looking at the second VM running, VM2, both the operating systems and VMware are in accordance that it is in full use and report that it is 50% busy.

This is a good example of the disparity that can sometimes occur.

OS vs VMware data

Here is data from a real VM.

The (top) dark blue line is the data captured from the OS, and the (Bottom) light blue line is the data from VMware. While there clearly is some correlation between the two, at the start of the chart there is about 1.5% CPU difference. Given we’re only running at about 4.5% CPU that is an overestimation by the OS of about 35%. While at about 09:00 the difference is ~0.5% so the difference doesn’t remain stable either. This is a small system but if you scaled this up it would not be unusual to see the OS reporting 70% CPU utilisation and VMware reporting 30%.

This large difference between what the OS thinks is happening and what is really happening all comes down to time slicing.

I'll be looking at time slicing on Monday.

Phil Bell

Consultant

Monday, 1 June 2015

The changing face of Capacity Management - Private Clouds (5of 5)

With a private cloud, a key output of any capacity management process must be information to the internal customers. In order to get this information, capacity and performance data must be captured and stored.

As an example, let's consider a VMware vSphere implementation that was put in place to replace an organization's Windows and Linux estate.

First of all, this data must be at the right granularity and at the right levels -- as I mentioned earlier, it's not enough to know what's happening inside the virtual machine or even just within the service itself.

Important data includes availability information, utilizations and allocations, service level agreements (how often are they violated) and financial data (costs, charges, and pricing) as well.

On top of data that's specific to that group, it's probably a good idea with a private cloud to include some "macro" level data. How much overall capacity is there within the private cloud? What are the overall utilizations? How much available capacity is there in the entire environment?

Again, it's easy to over-allocate or under-allocate by a small amount for each internal group or application, but it's just as important to show the "overall" view because it is incredibly costly to an organization if the overall environment is over-built (too much money spent on hardware, software, etc.) or under-built (lost business, unhappy customers).

So it stands to reason that any capacity management solution for a private cloud should capture data from a VMware environment at the datacenter, cluster, resource pool, host, and VM level. Provide you with data capture at a very granular level and have the ability to roll up into multiple levels of summaries over time.

It’s important to be able to incorporate business statistics, financial and costing information in to the database.

Reports and alerts (performance and trending) including these types of data help you to communicate effectively with your internal customers and your organization's management, in terms they understand.

It will come as no surprise that we have expertise in producing and implementing capacity management processes or that athene®, along with our capture packs, provides everything you need to successfully capacity manage your private cloud environment.

If you’d like more info call us or visit our website www.metron-athene.com

Rich Fronheiser

Chief Marketing Officer