Top 10 VMware metrics list to assist you in pinpointing performance bottlenecks within your VMware vSphere virtual infrastructure. I hope you find these useful.
CPU
1. Ave CPU Usage in MHz - this metric should be reported for both host and guest levels. Because a guest (VM) has to run on an ESX host, that ESX host has a finite limit of resource. High CPU Usage at the host level could indicate a bottleneck, however create a breakdown of all guests hosted to give a clear indication of who is using the most. If you have enabled DRS on your cluster, you may see a rise in the number of vMotions as DRS attempts to load balance.
2. CPU Ready Time - This is an important metric that gives a clear indication of CPU Overcommitment within your VMware Virtual Infrastructure. CPU Overcommitment can lead to significant CPU performance problems due to the way in which ESX CPU schedules Virtual CPU (vCPU) work onto Physical CPUs (pCPUs). This is reported at the guest level. Any values reported in seconds can indicate that you have provisioned too many vCPUs for this guest. Look at all the vCPUs assigned to all hosted guests and then the number of Physical CPUs available on the host(s) to see whether you have overcommitted the CPU.
Memory
3. Ave Memory Usage in KB - similar to Average CPU Usage, this should be reported at both Host and Guest levels. It can give you an indication in terms of who is using the most memory but high usage does not necessarily indicate a bottleneck. If Memory Usage is high, look at the values reported for Memory Ballooning/Swapping.
4. Balloon KB - values reported for the balloon indicate that the Host cannot meet its Memory requirements and is an early warning sign of memory pressure on the Host. The Balloon driver is installed via VMware Tools onto Windows and Linux guests and its job is to force the operating system, of lightly used guests, to page out unused memory back to ESX so it can satisfy the demand from hungrier guests.
5. Swap Used KB - if you see values being reported at the Host for Swap, this indicates that memory demands cannot be satisfied and processes are swapped out to the vSwp file. This is ‘Bad’. Guests may or will have to be migrated to other hosts or more memory will need to be added to this host to satisfy the memory demands of the guests.
6. Consumed - Consumed memory is the amount of Memory Granted on a Host to its guests minus the amount of Memory Shared across them. Memory can be over-allocated, unlike CPU, by sharing common memory pages such as Operating System pages. This metric displays how much Host Physical Memory is actually being used (or consumed) and includes usage values for the Service Console and VMkernel.
7. Active - this metric reports the amount of physical memory recently used by the guests on the Host and is displayed as “Guest Memory Usage” in vCenter at Guest level.
Disk I/O
8. Queue Latency - this metric measures the average amount of time taken per SCSI command in the VMkernel queue. This value must always be zero. If not, it indicates that the workload is too high and the storage array cannot process the data fast enough.
9. Kernel Latency - this metric measures the average amount of time, in milliseconds, that the VMkernel spends processing each SCSI command. For best performance, the value should be between 0-1 milliseconds. If the value is greater than 4ms, the virtual machines on the Host are trying to send more throughput to the storage system than the configuration supports. If this is the case, check the CPU usage, and increase the queue depth or storage.
10. Device Latency - this metric measures the average amount of time, in milliseconds, to complete a SCSI command from the physical device. Depending on your hardware, a number greater than 15ms indicates there are probably problems with the storage array. Again if this is the case, move the active VMDK to a volume with more spindles or add more disks to the LUN.
Note: Please be aware when reporting usage values, you take into consideration any child resource pools specified with CPU/Memory limits and report accordingly.
I'm running a two-part webinar series, starting this Thursday on VMware vSphere Performance Management Challenges
and Best Practices. Register and come along http://www.metron-athene.com/services/training/webinars/index.html
Jamie Baker
Principal Consultant
Metron-Athene Inc.
That's a great post and really has saved me a lot of work. Do you have any numbers around this monitoring? ie; acceptable CPU/memory/latency numbers?
ReplyDeleteThanks for your question. I've posted another blog today to answer this.
Delete