Thursday, 24 December 2015

Some end of 2015 thoughts from our CEO....

Gartner identified their Top 10 Strategic Technology Trends for 2015 at their October Symposium in Orlando. See

Knowing how capacity management people just love statistics, I thought I’d offer a percentage relevance/impact on capacity management of the 10 items from Metron’s perspective. What can’t be denied is that dependent on your industry, any or all of these will impact you at some stage if you are a capacity manager. 

Top 10 Item Capacity Management Impact     2015        Trend

Computing Everywhere                                        100%           à

Whether it is providing support for mobile devices with our athene® software or helping people handle the mass of data being generated as their own industry makes increasing use of them, this is definitely a major direction for Metron. 

Internet of Things                                                    20%           ↑ 

Some impact so far as our clients have had to deal with increasing volumes of data, predominantly due to Cloud based implementations. This will only increase as the Internet of Things becomes more pervasive, for example the predicted rapid growth in pay-per-use applications. 

3D Printing                                                                   5%           ↑ 

With shipments of 3D printers expected to double in 2016, this is clearly a major growth area. For Metron, we are starting to see the impact on applications with embedded links to 3D printing increasing the processing load on environments further. 

Advanced, pervasive and invisible analytics        50%         ↑

More and more systems, more and more data, means analytics needs to get smarter. Research and development for Metron continues to focus on new ways to intuitively analyze and use the ever- increasing pool of data available to capacity managers. 

Context Rich Systems                                                20%         ↑

New research and development in the works includes development of context-sensitive planning techniques to enable capacity planning to evolve with the emerging world rapid deployment, rapid change systems. 

Smart machines                                                         10%         ↑

The smarter the machine the greater the impact if it fails. Even with ever cheaper commoditized components, manufacturing will continue to want to optimize costs. Ensuring just the right capacity is available to each component of your smart car or helpful household robot will be essential to enable producers to maximize profits and keep feeding that next generation of R&D.

Cloud/Client Computing                                        100%         à

Already a significant element of life for an capacity Manager or Capacity Management software provider. Stories already circulate of overspend and uncontrolled spend on Cloud systems. Over time, businesses will increasingly need capacity management to prevent and remove the waste. For Metron, ever more development effort goes on enhancing the tools available to do this.

Software Defined Application and Infrastructure 50%      ↑

Metron moved to the agile principles underpinning SDA and SDI some years ago for our own in- house development, seeing how our software would need to evolve ever quicker in response to changing demands. Work underway now looks to provide the same agile approach to rules and models in our future software to deliver rapid response, self-learning capacity planning and management techniques.

Web-Scale It                                                                   20%     ↑

Many of our clients are now starting to deploy Cloud style systems pioneered by the likes of Google and Facebook, within their own data centers. For some time past and looking well ahead, Metron has been concentrating software enhancements on ensuring that our software evolves to support evolving data center architectures, while still supporting the traditional implementations that will provide core organization systems for some time.

Risk-Based Security and Self-Protection                    5%      ↑

With all of Metron’s athene® facilities being increasingly provided as web-based applications, work continues to provide ever-increasing application security, to supplement the sophisticated perimeter security now deployed by all our clients.

We hope all of our clients and friends have a wonderful holiday and we'll see you in 2016.

Andrew Smith
CEO, Metron

Tuesday, 8 December 2015

Vmware Capacity Planning

VMware is the go-to option for virtualization for many organizations, and has been for some time. The longer it’s been around, the more focus there is on making efficiency savings for the organization. This is where the Capacity Manager really needs to understand the technology, how to monitor it, and how to decide what headroom exists.

I'll be running a webinar 'VMware Capacity Planning' on December 16, where I'll  take a look at some of the key topics in understanding VMware Capacity. 

      •     Why OS Monitoring Can be Misleading

     5 Key VMWare Metrics for Understanding VMWa re capacity

     How VMWare processor scheduling impacts CPU capacity measurements

      Measuring Memory Capacity

      Measuring Disk Storage  Latency

     Calculating Headroom in VMs

Register for your place now

Phil Bell

Friday, 4 December 2015

Top 10 VMware Metrics to help pinpoint bottlenecks

Top 10 VMware metrics list to assist you in pinpointing performance bottlenecks within your VMware vSphere virtual infrastructure.  I hope you find these useful. 

1. Ave CPU Usage in MHz - this metric should be reported for both host and guest levels.  Because a guest (VM) has to run on an ESX host, that ESX host has a finite limit of resource.  High CPU Usage at the host level could indicate a bottleneck, however create a breakdown of all guests hosted to give a clear indication of who is using the most.  If you have enabled DRS on your cluster, you may see a rise in the number of vMotions as DRS attempts to load balance.
2. CPU Ready Time - This is an important metric that gives a clear indication of CPU Overcommitment within your VMware Virtual Infrastructure.  CPU Overcommitment can lead to significant CPU performance problems due to the way in which ESX CPU schedules Virtual CPU (vCPU) work onto Physical CPUs (pCPUs).  This is reported at the guest level.  Any values reported in seconds can indicate that you have provisioned too many vCPUs for this guest.  Look at all the vCPUs assigned to all hosted guests and then the number of Physical CPUs available on the host(s) to see whether you have overcommitted the CPU.
3. Ave Memory Usage in KB - similar to Average CPU Usage, this should be reported at both Host and Guest levels.  It can give you an indication in terms of who is using the most memory but high usage does not necessarily indicate a bottleneck.  If Memory Usage is high, look at the values reported for Memory Ballooning/Swapping.
4. Balloon KB - values reported for the balloon indicate that the Host cannot meet its Memory requirements and is an early warning sign of memory pressure on the Host.  The Balloon driver is installed via VMware Tools onto Windows and Linux guests and its job is to force the operating system, of lightly used guests, to page out unused memory back to ESX so it can satisfy the demand from hungrier guests.
5. Swap Used KB - if you see values being reported at the Host for Swap, this indicates that memory demands cannot be satisfied and processes are swapped out to the vSwp file.  This is ‘Bad’.  Guests may or will have to be migrated to other hosts or more memory will need to be added to this host to satisfy the memory demands of the guests.
6. Consumed - Consumed memory is the amount of Memory Granted on a Host to its guests minus the amount of Memory Shared across them.  Memory can be over-allocated, unlike CPU, by sharing common memory pages such as Operating System pages.  This metric displays how much Host Physical Memory is actually being used (or consumed) and includes usage values for the Service Console and VMkernel.
7. Active - this metric reports the amount of physical memory recently used by the guests on the Host and is displayed as “Guest Memory Usage” in vCenter at Guest level.
Disk I/O
8. Queue Latency - this metric measures the average amount of time taken per SCSI command in the VMkernel queue. This value must always be zero. If not, it indicates that the workload is too high and the storage array cannot process the data fast enough.
9. Kernel Latency - this metric measures the average amount of time, in milliseconds, that the VMkernel spends processing each SCSI command. For best performance, the value should be between 0-1 milliseconds. If the value is greater than 4ms, the virtual machines on the Host are trying to send more throughput to the storage system than the configuration supports. If this is the case, check the CPU usage, and increase the queue depth or storage.
10. Device Latency - this metric measures the average amount of time, in milliseconds, to complete a SCSI command from the physical device. Depending on your hardware, a number greater than 15ms indicates there are probably problems with the storage array.   Again if this is the case, move the active VMDK to a volume with more spindles or add more disks to the LUN.
Note:  Please be aware when reporting usage values, you take into consideration any child resource pools specified with CPU/Memory limits and report accordingly. 
We're running a webinar  'VMware Capacity Planning', register and come along
Jamie Baker
Principal Consultant

Wednesday, 2 December 2015

Idle VMs - Why should we care? (3 of 3)

Earlier in the week I looked at the impact idle VM’s can have on CPU utilization and memory overhead today I’m going to look at the amount of Disk or Datastore space usage per Idle VM. 

Each one will have associated VMDK (disk) files.  The files are stored within a Datastore, which in most cases is hosted SAN or NAS storage and shared between the cluster host members.  If VMDKs are provisioned as "Thick Disks" then the provisioned space is locked out within the Datastore for those disks.

To illustrate this an example of a least worst case scenario would be:  100 Windows  idle VMs have been identified across the Virtual Infrastructure and each VM has a "Thick" single VMDK of 20GB used to house the operating system.  This would then equate to 2TB of Datastore space being locked for use by VMs that are idle.  You can expand this further by, making an assumption  that some if not all VMs are likely to have more disks and of differing sizes.

The simple math will show you how much Datastore space is being wasted.
There is a counter to this, known as Thin Provisioning.  By using Thin disks, in which the provisioned disk size is reserved but not locked you  would not waste the same amount of space as you would by using Thick Disks.  Using Thin Provisioning also has the added benefit of being able to over allocate disk space thus leading to a reduction in the amount of up front storage capacity required, but only incurring minimal overhead.

Idle VMs -  why you should care.

Identifying Idle VMs, questioning whether they are required, finding out who owns them and  removing them completely will reduce or help eliminate VM sprawl and help to improve the performance and capacity of the Virtual Infrastructure by:

·       reducing unnecessary timer interrupts

·       reducing allocated vCPUs

·       reducing unnecessary CPU and Memory overhead

·       reduce used Datastore space

·       lead to more efficient use of your Virtual Infrastructure, including improved VM to Host ratios and reduction in additional hardware.
Don't forget to sign up for our Capacity Management Maturity online workshop.

Jamie Baker

Principal Consultant

Monday, 30 November 2015

Idle VMs - Why should we care? (2 of 3)

In my previous blog I mentioned the term VM Sprawl and this is where Idle VMs are likely to factor. 

Often VMs are provisioned to support short term projects,  for development/test processes or for applications which have now been decommissioned.  Now idle, they’re left alone, not bothering anyone and therefore not on the Capacity and Performance teams radar.

Which brings us back to the question.  Idle VMs - Why should we care? 
We should care, for a number of reasons but let's start with the impact on CPU utilization.

When VMs are powered on and running, timer interrupts have to be delivered from the host CPU to the VM.  The total number of timer interrupts being delivered depends on the following factors:

·       VMs running symmetric multiprocessing (SMP), hardware abstraction layers (HALs)/kernels require more timer interrupts than those running Uniprocessor HALs/Kernels.

·       How many virtual CPUs (vCPUs) the VM has.

Delivering many virtual timer interrupts can negatively impact on the performance of the VM and can also increase host CPU consumption.  This can be mitigated however, by reducing the number of vCPUs which reduces the timer interrupts and also the amount of co-scheduling overhead (check CPU Ready Time). 

Then there's the Memory management of Idle VMs.  Each powered on VM incurs Memory Overhead.   The Memory Overhead includes space reserved for the VM frame buffer and various virtualization data structures, such as Shadow Page Tables (using Software Virtualization) or Nested Page Tables (using Hardware Virtualization).  This also depends on the number of vCPUs and the configured memory granted to the VM.

We’ll have a look at a few more reasons to care on Wednesday, in the meantime why not complete our Capacity Management Maturity Survey and find out where you fall on the maturity scale.
Jamie Baker
Principal Consultant

Friday, 27 November 2015

Idle VM's - Why should we care? (1 of 3)

The re-emergence of Virtualization technologies, such as VMware, Microsoft's Hyper-V, Xen and Linux KVM has provided organizations with the tools to create new operating system platforms ready to support the services required by the business, in minutes rather than days. 
Indeed IT itself is a service to the business.

In more recent times, Cloud computing which in itself is underpinned by Virtualization, makes use of the functionality provided to satisfy :
  • on-demand resources
  •           the ability to provision faster
  •           rapid elasticity (refer to NIST 's description of Cloud Computing)
     Cloud computing makes full use of the underlying clustered hardware. Constant strides are being made by Virtualization vendors to improve the Virtual Machine (VM) to Host ratio, without affecting the underlying performance. 
But, you may ask "What's this got to do with Idle VMs?"

Well, as I described earlier Virtualization provides the means to easily and quickly provision virtual systems. Your CTO/CIO is going to demand a significant ROI once  an investment in both the hardware and virtualization software has been made, possibly switching the focus to an increase in the VM to Host ratio. 

“What's wrong with that?” I hear you say.  Nothing at all, as long as you keep track of what VMs you are provisioning and :

  •          what resources you have granted
  •          what they are for

Failure to do so will mean that your quest for a good ROI and a satisfied Chief will be in jeopardy, as you’ll encounter a term most commonly known as VM Sprawl.
More about this on Monday.
In the meantime why not register for my webinar VMware Capacity Planning'

Jamie Baker

Principal Consultant

Monday, 23 November 2015

VMware – Virtual Center Headroom (17 of 17) Capacity Management, Telling the Story

Today I’ll show you one final report on Vmware, which looks at headroom available in the Virtual Center.

In the example below we’re showing CPU usage. The average CPU usage is illustrated by the green bars, the light blue represents the amount of CPU available across this particular host and the dark blue line is the total CPU power available.

VMware – Virtual Center Headroom
We have aggregated all the hosts up within the cluster to see this information.
We can see from the green area at the bottom how much headroom we have to the blue line at the top,although actually in this case we will be comparing it to the turquoise area as this is the amount of CPU available for the VM’s.
This is due to the headroom taken by VMkernel which has to be taken in to consideration and explains the difference between the dark blue line and the turquoise area.

To summarize my blog series, when reporting:

        Stick to the facts

        Elevator talk

        Show as much information as needs to be shown

        Display the information appropriate for the audience

        Talk the language for the audience

….Tell the Story

Sign up to our Community and get access to all our Resources, on-demand webinars, white papers and more....

Charles Johnson
Principal Consultant

Friday, 20 November 2015

VMware Reports (16 of 17) Capacity Management, Telling the Story

Let’s take a look at some examples of VMware reports.

The first report below looks at the CPU usage of clusters in MHz. It is a simple chart and this makes it very easy to understand.

VMware – CPU Usage all Clusters

You can immediately see who the biggest user of the CPU is, Core site 01.
The next example is a trend report on VMware resource pool memory usage.
The light blue indicates the amount of memory reserved and the dark blue line indicates the amount of memory used within that reservation. This information is then trended going forward, allowing you to see at which point in time the required memory is going to exceed the memory reservation.
VMware – Resource Pool Memory Usage Trend
A trend report like this is useful as an early warning system, you know when problems are likely to ensue and can do something to resolve this before it becomes an issue.

We need to keep ahead of the game and setting up simple but effective reports, provided automatically, will help you to do this and to report back to the business regarding requirements well in advance.

On Monday I’ll show you one final report on Vmware, which looks at headroom available in the Virtual Center, in the meantime take a look at out Capacity Management Maturity workshop

Charles Johnson
Principal Consultant

Wednesday, 18 November 2015

Model – Linux server change & disk change (15 of 17) Capacity Management, Telling the Story

Following on from Monday's blog today I'll show the model for change in our hardware.

In the top left hand corner we are showing that once we reach the ‘pain’ point and then make a hardware upgrade the CPU utilization drops back to within acceptable boundaries for the period going forward.

In the bottom left hand corner you can see from the primary results analysis that the upgrade would mean that the distribution of work is more evenly spread now.

The model in the top right hand corner has bought up an issue on device utilization with another disk so we would have to factor in an I/O change and see what the results of that would be and so on.

In the bottom right hand corner we can see that the service level has been fine for a couple of periods and then it is in trouble again, caused by the I/O issue.

Whilst this hardware upgrade would satisfy our CPU bottleneck it would not rectify the issue with I/O, so we would also need to upgrade our disks.

When forecasting modeling helps you to make recommendations on changes that will be required and when they will need to be implemented.

On Friday I'll take a look at some examples of VMware reports.

Charles Johnson
Principal Consultant


Monday, 16 November 2015

Modeling Scenario (14 of 17) Capacity Management, Telling the Story

I have talked about bringing your KPI’s, resource and business data in to a CMIS and about using that data to produce reports in a clear, concise and understandable way.
 Let’s now take a look at some analytical modeling examples, based on forecasts which were given to us by the business.
Below is an example of an Oracle box, we have been told by the business that we are going to grow at a steady rate of 10% per month for the next 12 months. We can model to see what the impact of that business growth will be on our Oracle system. 

In the top left hand corner is our projected CPU utilization and on the far left of that graph is our baseline. You can see that over a few months we begin to go through our alarms and our thresholds pretty quickly.

Model – oracleq000 10% growth – server change

In the bottom left hand corner we can see where bottlenecks will be reached indicated by the large red bars which indicate CPU queuing.
On the top right graph we can see our projected device utilization for our busiest disk and we can see that within 4 to 5 months it is also breaching our alarms and thresholds.
Collectively these models are telling us that we are going to run in to problem with CPU and I/O.
In the bottom right hand graph is our projected relative service level for this application. In this example we started the baseline off at 1 second, this is key.

By normalizing the baseline at 1 second it is very easy for your audience to see the effect that these changes are likely to have. In this case, once we’ve added the extra workload we can see that we go from 1 second to 1.5 seconds (a 50% increase) and then jumped from 1 second to almost 5 seconds. From 1 to 5 seconds is a huge increase and one that your audience can immediately grasp and understand the impact of.

We would next want to show the model for change in our hardware and I'll be looking at this on Wednesday.

Wednesday is also the day of our 'Essential Reporting' webinar, if you haven't registered for your place there's still time to.

Charles Johnson
Principal Consultant

Friday, 13 November 2015

Business Metric Correlation (13 of 17) Capacity Management, Telling the Story

As mentioned previously it is important to get business information in to the CMIS to enable us to perform some correlations.

As in the example below we have taken business data and taken component data and we can now report on this together to see if there is some kind of correlation.

Business Transactions vs. CPU Utilization
In this example we can see that the number of customer transactions(shown in dark blue) reasonably correlates with the amount of CPU utilization.
Can we make some kind of judgment based on just what we see here? Do we need to perform some further statistical analysis on this data? What is the correlation co-efficiency for our application data against the CPU utilization?
Closer to the value of 1 indicates that there is a very close correlation between the application data and the underlying component data.
What can we do with this information back to the business? An example would be: This graph indicates that there is a very close correlation between the number of customer transactions and the CPU utilization. Therefore, if we plan on increasing the number of customer transactions in the future we are likely to need to do a CPU upgrade to cope with that demand.
On Monday I'll be looking at a Modeling scenario.
Charles Johnson
Principal Consultant

Wednesday, 11 November 2015

Linux Server – Disk Utilization (12 of 17) Capacity Management, Telling the Story

Below is an example of a report on disk utilization of a Linux server. The reason I chose to share this report is that it is an instance based report, displaying the top 5 disks and their utilization on this system.

You have the ability to pick out our top 5 or our bottom 5 to display to your audience because we don’t want too much ‘noise’ on our chart.

We want to keep things clear and concise, don’t flood reports with meaningless data and keep it relevant to our audience.

On Friday I'll be discussing Business Metric Correlation and why it's important to view business and component data together.

Don't miss out on our 'Essential Reporting' webinar, register now.

Charles Johnson
Principal Comsultant

Monday, 9 November 2015

Unix Reports - Capacity Management – Telling the Story (11 of 17)

As promised today we'll be looking at a Unix report, let’s begin with an example which has been created for a Linux box.

Below is a simple utilization report for a Linux box running an application, it is for a day and it shows us that there are a couple of spikes where it has breached our CPU threshold.

Linux Server - CPU Utilization

Looking at this report we can see that these peaks take place during the middle of the day. Is that normal behavior? Do you know enough about the system or application that you are reporting on to make that judgment? Do we need to perform a root cause analysis? If we believe the peaks to be normal then maybe we need to adjust the threshold settings or if we are unsure then we need to carry out further investigation. Has some extra load been added to the system? Has there been a change? Are there any other anomalies that you need to find out about?

Remember when reporting don’t make your reports over complicated.

Don’t put too much data on to one chart, it will look messy and be hard to understand.

On Wednesday I'll be talking about Disk Utilization on a Linux server, in the meantime sign up for our 'Essential reporting' webinar on Nov 18

Charles Johnson
Principal Consultant

Friday, 6 November 2015

Dashboard (10 of 17) Capacity Management, Telling the Story

Dashboard – Overview Scorecard

In the following example of a dashboard we can immediately see that we have a green, 2 reds and some greys. Based on the green, amber and red status we can immediately see that we have an issue with a couple of these categories, memory and I/O.

Is this enough information? Who is viewing this information and does it tell them enough? If management were looking at this information they would be worried as they can see red in the status. It does scare senior management when they see a red status, mainly due to the fact that they do not have the time or inclination to see what is behind the issue. They would immediately be on the phone to their capacity management team asking why there are issues and it then causes more pressure further down the tree.
It may be that this particular issue is not an immediate problem, maybe one of the thresholds was breached during a certain time period and needs investigation.
Dashboard – Overview Scorecard Detail
We can drill down and find out some further information on the issue in this case.
In the report below there is still some red showing so it is going to have to be investigated fully and we would need to drill down even further to find out what applications are involved here.
In the further drill down report below we can see that we have some paging activity on Unix that has breached the threshold.

These red, amber and green scorecards have to be based on thresholds.
Where the grey is shown this simply means that there is no threshold data attached to that.
We need to get in to the details to understand what the root cause of the issue is and understand whether the issue is serious or not.
On Monday I'll be taking a closer look at Unix reports. In the meantime why not take a few minutes and complete our online Capacity Management Maturity Survey to find out where you fall on the Maturity Scale and receive a 20 page report for free.
Charles Johnson
Principal Consultant