Metron - Capacity Management: threshold

Showing posts with label threshold. Show all posts

Wednesday, 7 September 2016

Memory, what to monitor - Windows Server Capacity Management 101(10 of 12)

Today we’ll look at memory and what you should be monitoring.

Memory utilization of whole system- if need be look at process working set sizes to see who’s the “culprit”, this will show you which process is using the most memory and is a good way to detect memory leaks. A good rule of thumb for memory utilization is to have at least 10% left, this is to prevent excess paging which massively hurts performance.

Page file usage% - if this is high it means that you are regularly running out of memory and windows is having to use the page files.

Memory leaks - when an application dynamically allocates memory, and does not free that memory when it is finished using it, that program has a memory leak. The memory is not being used by the application anymore, but it cannot be used by the system or any other program either. Memory leaks add up over time, and if they are not cleaned up, the system eventually runs out of memory.

How to monitor

Thresholds – when setting a threshold a good place to start is 80% warning and 90% alarm, remember if you are seeing performance issues before hitting the threshold then the threshold should be adjusted. If constantly breached, reset the value or look for memory leak.

Memory Utilization report, example

Above is a good example of a memory leak, you can see that memory utilization is slowly creeping up then I restart the machine it drops down and then starts to creep up again.

I'll share some best practice recommendations for monitoring and managing memory on Friday.

Josh Worth

Consultant

Wednesday, 6 July 2016

Modeling Scenario (14 of 17) Capacity Management, Telling the Story

I have talked about bringing your KPI’s, resource and business data in to a CMIS and about using that data to produce reports in a clear, concise and understandable way.

Let’s now take a look at some analytical modeling examples, based on forecasts which were given to us by the business.

Below is an example of an Oracle box, we have been told by the business that we are going to grow at a steady rate of 10% per month for the next 12 months. We can model to see what the impact of that business growth will be on our Oracle system.

In the top left hand corner is our projected CPU utilization and on the far left of that graph is our baseline. You can see that over a few months we begin to go through our alarms and our thresholds pretty quickly.

Model – oracleq000 10% growth – server change

In the bottom left hand corner we can see where bottlenecks will be reached indicated by the large red bars which indicate CPU queuing.

On the top right graph we can see our projected device utilization for our busiest disk and we can see that within 4 to 5 months it is also breaching our alarms and thresholds.

Collectively these models are telling us that we are going to run in to problem with CPU and I/O.

In the bottom right hand graph is our projected relative service level for this application. In this example we started the baseline off at 1 second, this is key.

By normalizing the baseline at 1 second it is very easy for your audience to see the effect that these changes are likely to have. In this case, once we’ve added the extra workload we can see that we go from 1 second to 1.5 seconds (a 50% increase) and then jumped from 1 second to almost 5 seconds. From 1 to 5 seconds is a huge increase and one that your audience can immediately grasp and understand the impact of.

We would next want to show the model for change in our hardware and I'll be looking at this on Friday.

In the meantime why not join our Community and get access to a wealth of Capacity Management Resources http://www.metron-athene.com/_resources/

Charles Johnson

Principal Consultant

Wednesday, 29 June 2016

Unix Reports (11 of 17) Capacity Management, Telling the Story

As I mentioned on Monday today and on Friday we'll look at a couple of example reports which have been created for a Linux box.

Below is a simple utilization report for a Linux box running an application, it is for a day and it shows us that there are a couple of spikes where it has breached our CPU threshold.

Looking at this report we can see that these peaks take place during the middle of the day. Is that normal behavior? Do you know enough about the system or application that you are reporting on to make that judgment? Do we need to perform a root cause analysis? If we believe the peaks to be normal then maybe we need to adjust the threshold settings or if we are unsure then we need to carry out further investigation. Has some extra load been added to the system? Has there been a change? Are there any other anomalies that you need to find out about?

Remember when reporting don’t make your reports over complicated.

Don’t put too much data on to one chart, it will look messy and be hard to understand.

On Friday I'll show you an example of a report on disk utilization of a Linux server.

In the meantime sign up to our Community and get access to white papers, on-demand webinars and more http://www.metron-athene.com/_resources/index.html

Charles Johnson

Principal Consultant