Not too long ago, ServerChoice underwent a significant upgrade of our monitoring systems. Now the dust has settled and everything is bedded in, it seemed like a good time for me to write a blog that explains a bit about it.
What is it?
For the uninitiated, a monitoring system is what we use to keep track of our own and our customers’ infrastructure. We need to know what’s going on at all times and have alerts so we can jump on things before they turn into problems. It’s all about being proactive: the more you know, the better you can be.
I researched a lot of different monitoring systems: what they do, how they work, etc, and I discovered that the one thing you really need is a plan. As the old saying goes, “If you have a plan, you have the world”, and to safeguard our Monitoring World, there are a few things you need to account for:
- What do you want to monitor: servers, networks, components etc?
- What services do you want to monitor: ping, CPU load, memory etc?
- How do they need to be monitored: plugins, SNMP etc?
- Centralised or distributed monitoring?
- What sort of alerting do you need?
- Do you want to capture performance metrics?
Once you have these questions answered you can then begin to look at the myriad of monitoring solutions on the market. There are all sorts out there, from free GNU-based solutions to paid-for proprietary software suites. The key issue I found was that there is no one-size-fits-all solution (oh, if only), so you’ll have to read the specs and get creative.
I've personally looked into using Nagios, Zenoss, Zabix etc and found each one has its strengths and weaknesses. I decided to use Nagios (actually a distro based on Nagios) as it's pretty much the industry standard, has tons and tons of plugins, and lots of community support.
When setting up monitoring on host, it's always a good idea to have a customised baseline for your alerting: you need boundaries to know what’s normal and what’s not, and this will differ between machines. For example: if you want to monitor memory usage on a Windows 2012 box running SQL server, it will generally use lots of memory. This is the nature of the software and so needs a high baseline, perhaps 80% for warnings and 90% for alerts. For servers running different software, a lower base line would be more appropriate. Next you see how it runs for a couple of days and, if the alerts keep popping up, then tweak accordingly. What you don’t want is to be constantly be alerted and waste time trying to solve a problem that doesn't really exist.
Another way of dealing with these false positives is to set timeframes on your alerts. If a server every now and then peaks at 95% CPU, give the alert a time limit of 5 minutes. If the CPU is still at 95% after that then get the monitoring to send out the alert notification. You will find that you need to do this across your entire server estate, which takes time to get right, but is well worth the hassle as in the long run. The more effort you put in at the beginning means the more effort saved in the end: when you do get an alert, you know it's a real problem.
In at the deep end
My parting advice is not to try to jump in at the deep end. Don’t fall into the trap of getting bogged down with excess monitoring on individual hosts. Instead, start with the basics and work your way up from there – otherwise you’ll have stats coming out your ears and not know what to do with them.