Dangers with OS Metrics
Almost every time we discuss data capture for VMware, we’ll be asked by someone if we can capture the utilization of specific VMs, by monitoring the OS. The simple answer is no.
The more complex answer is that we can capture the data from the OS, but it may not be reliable. So here’s an example of why.
We have 2 VMs. Within the 1 second interval we are looking at, one of the VMs was only allocated the CPU for ½ a second. In that ½ second the VM used 50% of it’s possible CPU time. So from the OS perspective it was running at 50% CPU utilization. If we look at data from VMware, we’ll see that VMware knows the VM only used ½ the CPU available in ½ a second. Or 25%.
The 2nd VM was running on CPU for the entire second. And again it used 50% of it’s possible CPU. So, to the OS, it appears it was running at 50% CPU utilization, and VMware has the same result.
The more contention there is for CPU time, the more time VMs will spend Dormant/Idle, and the further apart the values will be. This effect means that any metrics which have an element of time in their calculation cannot be relied upon to be accurate.
Here is data from a real VM
The (top) dark blue line is the data captured from the OS, and the (Bottom) light blue line is the data from VMware. There clearly is some correlation between the two. At the start of the chart there is about a 1.5% CPU difference. Given we’re only running at about 4.5% CPU that is an overestimation by the OS of about 35%. But at about 09:00 the difference is ~0.5% so the difference doesn’t remain stable either.
Historically it’s not been unusual to see situations where the OS metric is reporting 70% CPU utilization and VMware is reporting 30%.
More on Wednesday, in the meantime don't forget to register for our next webinar 'Top 5 VMware tips for performance and capacity'