Metron - Capacity Management: July 2017

Monday, 31 July 2017

Maturing your Capacity Management processes ( 1 of 11)

Mature capacity management whilst difficult to achieve can provide enormous value to an organization.

Regardless of the technology; the methods and techniques employed in capacity management are essential in optimizing the IT expenditure, reducing the risk associated with IT change and ultimately ensuring that the IT resource meets the needs of the business.

One of the biggest challenges in implementing a new capacity management process or maturing an existing process is aligning with the business. This alignment ultimately provides the key information required to understand “the capacity” of an enterprise and plan accordingly.

This is an essential step in implementing and providing mature capacity management, but the majority of organizations have yet to achieve this and are still very much focused at the component level of capacity management.

This is true across all market sectors with these organizations exhibiting the following common traits:

· Business view IT capacity as infinite

· Lots of data available, but not used to its full potential

· Capacity management has a reactive focus

· Planning purely based on historical data

· Any capacity management is happening in technical silo’s

In this blog series I’ll aim to address some of the challenges faced when implementing business level capacity.

Jamie Baker

Principal Consultant

Monday, 24 July 2017

Capacity Management Maturity webinar (3) – Defined to Managed

The third part in our 'Capacity Management Maturity webinar' series is on the horizon.

It looks at how to successfully build on the proactivity now established within your Capacity Management process. At the defined level, we have component monitoring and automated reporting in place and the attention starts to focus on the Service and Business.

The webinar will specifically cover how interaction with Service and Business managers can help align the capacity management process with the Business and work towards IT having a more strategic partnership.

Moreover, we show how at the Managed level your Capacity Management process will provide additional benefits to the business through:

Increased Service quality and improved Service delivery through mature Service Capacity Management
Established working relationships with Service and Business Managers
Established automation of process integration (ITSM) via a CMIS
Integration of Service catalogues
Key stakeholder in the Change process
Key stakeholder in End to end SLA reporting

Join us on August 16 to continue the Capacity Management process maturity journey and identify the necessary steps and benefits it can provide.

https://www.metron-athene.com/services/webinars/capacity-management-webinars.html

If you missed parts 1 and 2 of our Capacity Management Maturity series, then catch up by watching them on-demand.

https://www.metron-athene.com/resources/index.asp

Jamie Baker

Principal Consultant

Friday, 21 July 2017

Understanding VMware - Summary (10 of 10)

I'll summarize my Understanding VMware series for you today.

Ready time is important. It’s a measure that directly shows negative effects on service time.

Any ready time accumulated has an impact on performance. (Although it may still be acceptable performance)

It’s not enough to monitor CPU%, you need to monitor Ready Time as well. The more vCPUs a VM has, the harder it is to schedule it onto the CPU threads available. Use as few vCPUs as possible.

VMware provide Ready Time as a number of seconds in an interval. It’s possible to convert this using the interval length into a % value. Anything more than 10% Ready Time is likely to indicate unacceptable performance.

Memory is not just about occupancy. Make sure you look at ballooning, active and swap as well.

Ballooned Memory works by “robbing Peter to pay Paul”.

Active Memory is the amount of memory the VM has accessed in the last 20 seconds.

Swap takes place in the cases where ballooning and page sharing are not sufficient to reclaim memory, ESX employs hypervisor swapping to reclaim memory.

Disk latency can be observed and broken down to internal and external to VMware. There are 3 metrics: Queue Latency Kernel Latency Device Latency. We only get Queue if we’re not getting the performance we require out of Kernel and Device. So Kernel and Device remain the focus of any investigation into poor IO performance.

VMware Cluster Planning. To report back on the number of VMs you can fit in a cluster, decide what the size of your cluster actually is for this exercise. Find a good peak to work with and trend your results over time. It’s not a static picture.

If you've enjoyed my series then go to our Resources section where you can watch webinars on-demand and download VMware capacity management white papers and more.

https://www.metron-athene.com/resources/index.asp

Phil Bell

Consultant

Wednesday, 19 July 2017

Understanding VMware - Calculating headroom in VM's (9 of 10)

So as I mentioned before the question I get asked the most when discussing VMware capacity is “How many more VMs can I fit in this cluster?”

Which is similar to asking how many balls used for a variety of sports, does it take to fill an Olympic swimming pool? Unfortunately “It depends” is not an acceptable answer for a CIO.

The business wants a number, so as a business focused IT department an answer must be given. The key is that it’s ok to estimate. Anybody who’s compared the average business forecast to what eventually happens in reality, knows the business is ok with estimates.

So how do we figure out the number to tell the business.
If we calculate the size of our average VM, and the size of the cluster, then divide one by the other and that’s the total number of VMs, now just take off the current number of VMs right?

Sounds simple. Except we need to define what’s the size of our cluster. Are we allowing for one or more hosts to fail? Can we identify the size of largest host(s)?

We also need to decide what metrics we are going to size on. Do you want to size on vCPUs to Core ratio, or MHz CPU and MB Memory, or some other limitation?

Can you then calculate what your average VM is at every point during the day and pick the peak or a percentile?

Would you decide to agree on an average size for Small, Medium, and Large VMs, then calculate the number of each currently and extrapolate with the existing ratios?

You have to be able to answer these questions before you can start to do the calculations.

Clearly you need data to work with for this. You can manually read info out of vSphere client, and note it down. But I’d suggest you find a tool to automate the data collection.

You’ll need to review the data and make sure it’s a good period to be using for the exercise.

E.g. not during windows updates and a reboot of every VM!

You should also try to include the known projects. You might have 1000 VMs currently, but if there are 250 planned for implementation in the next 6 months you’ll want to take them into account.

Here’s an example of a good peak (circled).

The actual peak is a blip that we don’t want to size on. But the circled peak is a nice clean example, that’s in line with other days.

Given the size of the cluster in MB Memory and MHz CPU, the number of current VMs, the size of an average VM, and the size of the largest host I put together a spreadsheet.

There’s a calculation that takes the size of the largest host off the size of the cluster, then calculates 90% of the result. Then calculates the number of average VMs that will fit, and the space available in average VMs for both Memory and CPU. The smallest of the values is then displayed along with either Memory or CPU as the “Bound By” metric.

Conditional formatting on a cell displaying the number of VMs available sets a Red, Amber, Green status.

By including a sheet that can contain the number of VMs needed for future projects, then I calculated a second value including them.

Exporting some of the values I calculated on a regular basis, enables me to then trend over time, the number of VMs that are available in the cluster. Still taking into account the largest host failing, and 90% of the remaining capacity being the max.

In this case, activity was actually falling overtime, and as such the number of VMs available in the cluster was increasing in terms of CPU capacity.

On Friday I'll do a Round-Up of my series and hope to see some of you at my webinar today.

Phil Bell

Consultant

Monday, 17 July 2017

Understanding VMware Capacity - Cluster Memory ( 8 of 10)

Today we'll look at measuring memory capacity in the cluster, just a reminder that in Friday's blog we discussed where ballooning is persistent (rather than an occasional spike), and that changes should be made in order to ensure there is enough RAM available for the VMs in the cluster.

I explained that ballooning itself has an overhead on the hypervisor and as such there is the potential to impact performance for the host. Changes don’t necessarily have to be more RAM installed. The very first thing to consider is if any VMs have been created and “forgotten” about.If we look the cluster as a whole we can see a significant event at the same time the previous VM saw the balloon inflate in memory.

Shared memory plummets, this causes an increase in the demand on memory and, in turn this causes the balloon driver (memory control), to consume more memory, and the swapped memory to increase.

Then, shared memory slowly recovers. The process to identify shares pages only checks a set number of pages each interval. So it takes a while to identify all the shared pages and free up the space taken by duplicates.

What caused the shared memory to drop so much? Windows updates and a reboot. When a VM starts every page is unique until a duplicate is identified, which takes a short while.

The question we tend to be asked is: How can we model what will happen to our VMware clusters? Typically the underlying questions are more like:

When do I need to buy more hosts? As in: I need to figure out my budget for the next 12 months, and I’ve no idea if there needs to be any money set aside for hardware purchases.

Will planned projects fit? I know we have new things planned but I don’t know if they are going to impact our forecasts for the hardware we may need.

How many more VMs can we provision? I’m used to talking about rack space and space in the datacenter, so give me something I can understand to determine how much space is left.

I've run a webinar on VMware Cluster Planning and you can listen to the recording here

https://attendee.gotowebinar.com/recording/4370618472757209089

On Wednesday I'll be looking at measuring Disk Storage Latency and I'll also be broadcasting live with our 'Top 5 VMware Tips for Performance and Capacity', if you haven't registered for this event then you've still got time to do so.

https://www.metron-athene.com/services/webinars/capacity-management-webinars.html

Phil Bell

Consultant

Friday, 14 July 2017

Understanding VMware Capacity - Measuring Memory Capacity (6 of 10)

Today we'll take a look at measuring memory capacity, below is a typical graph we might use to understand how memory is being used by a VM.

The VM has been granted about 4GB RAM. That’s what the OS thinks it has to use.

The host memory in use, shows how much physical RAM is currently being used by the VM. We can see that at 09:00 this increases and at the same time shared memory reduces.

Memory Used is our term for Active memory, and remains steady throughout, as does the Memory Overhead on the host used to support this VM. What we can see is that only about 400MB of memory is being accessed on a regular basis. Between 1.5GB and 2GB of memory is not unique to this VM and is shared with others.

Balloon Driver

The balloon driver takes about 2.5GB of space in the memory of the VM. This will cause the OS to swap out to disk until it cannot do any more, and then the hypervisor starts to swap out memory also. At that point, performance is likely to be impacted.

Here's an example of the balloon driver in use.

Where ballooning is persistent (rather than an occasional spike), then changes should be made in order to ensure there is enough RAM available for the VMs in the cluster. Ballooning itself has an overhead on the hypervisor and as such there is the potential to impact performance for the host. Changes don’t necessarily have to be more RAM installed. The very first thing to consider is if any VMs have been created and “forgotten” about. On Monday I'll take a look at Cluster memory.

Phil Bell

Consultant

Wednesday, 12 July 2017

Understanding VMware Capacity - Measuring Memory Capacity (5 of 10)

Memory capacity is generally the limiting factor in a VMware cluster. This is improving as hardware specifically designed to be a host has become available with plenty of RAM able to be installed, but clusters still generally have less Memory headroom than CPU. Memory on VMware is not just a case of monitoring % used. Again like CPU we want to be able to compare memory utilizations in comparable metrics, so that’s MB or GB. Then we move on to the specific VMware type metrics.

Reservations

A VM or a resource pool can have a reservation. Which means if it needs that memory, it gets to use it - no questions asked. However you cannot reserve more memory than exists in the host server.

Limits. Just because a VM has been configured to have 8GB RAM, doesn’t mean you can’t set a lower limit of say 4GB RAM. At which point, if the VM tries to use more than the limit, then some data will be placed into the Swap file of the OS or VMware may balloon some of the VMs memory to free up pages that the OS can use.

Ballooning

When VMware wants to allocate memory to a VM, but there is a shortage of Memory, then a balloon may be inflated inside the memory of one (or more) VMs. This balloon pins itself into RAM and cannot be swapped out, thus forcing the OS to swap some memory out to disk. The pages pinned into memory are not all actually stored in memory on the Host, as it knows the contents of the balloon are not important. This then frees up pages, that the Host can allocate the VM it wanted to give memory too. One of the reasons this happens is that an OS will use a page to store some data, then later the program will no longer need that data, and the OS puts that page on the free list. However, the hypervisor has no idea the page is no longer needed. The data in the page has remained unchanged. In order to identify free pages when the balloon expands and the OS put it into the free pages, the hypervisor can

identify the balloon memory pages, and therefore the pages that the OS is not using for other processes.

Shared pages

Shared Pages are pages in memory that are identical. Rather than store duplicates of the same page, the hypervisor will store a single copy, and point the appropriate VMs to it. This works well where servers are running the same OS and doing the same job, and therefore much of the memory in use is identical.

Active Memory

Active memory is the amount of memory that has been accessed in the last 20 seconds. Having sufficient memory to contain the active memory of the VMs is crucial to performance.

Memory Available for VMs

Memory available to the VMs shows what it says. There is some memory in the cluster that VMs cannot use as it’s being used by the Hypervisor to support the VMs. The memory left over is available.

On Friday I'll talk about understanding how memory is being used by a VM. In the meantime don't forget to sign up for my next webinar 'Top 5 VMware Tips for Performance and Capacity'

https://www.metron-athene.com/services/webinars/capacity-management-webinars.html

Phil Bell

Consultant

Monday, 10 July 2017

Understanding VMware Capacity - Ready Time (4 of 10)

Imagine you are driving a car, and you are stationary. There could be several reasons for this. You may be waiting to pick up someone, you may have stopped to take a phone call, or it might be that you have stopped at a red light. The 1st two of these (pick up, phone), you decided to stop the car to perform a task. But in the 3rd case, the red light is stopping you doing something you want to do. You spend the whole time at the red light ready to move away as soon as you get a green light. That time you spend waiting at a red light is ready time. When a VM wants to use the processor, but is stopped from doing so it accumulates ready time. This has a direct impact on the performance of the VM.

Ready Time can be accumulated even if there are spare CPU MHz available. For any processing to happen all the vCPUs assigned to the VM must be running at the same time. This means if you have a 4 vCPU VM, all 4 vCPUs need available cores or hyperthreads to run. So the fewer vCPUs a VM has, the more likely it is to be able to get onto the processors. You can reduce contention by having as few vCPUs as possible in each VM. And if you monitor CPU Threads, vCPUs and Ready Time for the whole Cluster, then you’ll be able to see if there is a correlation between increasing vCPU numbers and Ready Time.

Here is a chart showing data collected for a VM. In each hour the VM is doing ~500 seconds of processing. The VM has 4 vCPUs.

Despite just doing 500 seconds of processing, the ready time accumulated is between

~1200 and ~1500 seconds. So anything being processed spends 3 times as long waiting to be processed, as it does actually being processed. i.e. 1 second of processing could take 4 seconds to complete.

Now lets look at a VM on the same host, doing the same processing on the same day. Again we can see ~500 seconds of processing in each hour interval. But this time we only have 2vCPUs. The ready time is about ~150 seconds. i.e. 1 second of processing takes 1.3 seconds.

By reducing the number of vCPUs in the first VM, we could improve transaction times to somewhere between a quarter and a third of their current time.

Here’s a short video to show the effect of what is happening inside the host to schedule the physical CPUs/cores to the vCPUs of the VMs. Clearly most hosts have more than 4 consecutive threads that can be processed. But let’s keep this simple to follow.

1. VMs that are “ready” are moved onto the Threads.

2. There is not enough space for all the vCPUs in all the VMs. So some are left behind. (CPU Utilization = 75%, capacity used = 100%)

3. If a single vCPU VM finishes processing, the spare Threads can now be used to process a 2 vCPU vm. (CPU Utilization = 100%)

4. A 4 vCPU VM needs to process.

5. Even if the 2 single vCPU VMs finish processing, the 4 vCPU VM cannot use the CPU available.

6. And while it’s accumulating Ready Time, other single vCPU VMs are able to take advantage of the available Threads

7. Even if we end up in a situation where only a single vCPU is being used, the 4 vCPU VM cannot do any processing. (CPU utilization = 25%)

As mentioned when we discussed time slicing, improvements have been made in the area of co-scheduling with each release of VMware. Amongst other things the time between individual CPUs being scheduled onto the physical CPUs has increased, allowing for greater flexibility in scheduling VMs with large number of vCPUs.

Acceptable performance is seen from larger VMs.

Along with Ready Time, there is also a Co-Stop metric. Ready Time can be accumulated against any VM. Co-Stop is specific to VMs with 2 or more vCPUs and relates to the time “stopped” due to Co-Scheduling contention. E.g. One or more vCPUs has been allocated a physical CPU, but we are stopped waiting on other vCPUs to be scheduled.

Imagine the bottom of a “ready” VM displayed, sliding across to a thread and the top sliding across as other VMs move off the Threads. So the VM is no longer rigid it’s more of an elastic band.

Phil Bell
Consultant