Although Nagios has been useful at identifying some problems with my computer network, it keeps producing false positives.
Those false positives come from NRPE and, to a lesser extent, its Windows equivalent. These are the daemon/services that run on the machines to be checked, and supposédly offer more detailed checks for developing problems.
One day, Nagios stopped talking to my Linux server. Nrpe was running on it. The usual commands, ping, SSH, etc., all worked. I could call Nrpe by hand from the command line on my Nagios system, & it returned the expected results. Yet Nagios keep reporting otherwise. I never worked out the problem, because every test I could think of worked. In the end, I decided detailed problem detection on that server just wasn’t worth the hassle (it’s a build server, so it’ll be pretty obvious when it’s borked), so deleted the checks.
I’ve been told Mac services that are running have stopped. This is particularly annoying because many Mac services are so brittle they’re liable to fall over if next door’s dog sneezes: I really do want to know when they stop. So, of course, a Mac service that completely borked its host was not mentioned until the host went down. I’ve had false positives from Windows machines. I’ve deleted the lot.
I told Nagios to only report errors once every couple of hours. I’m a home user, there’s no one depending on my systems working perfectly. Yet, despite this, when an error occurs, I get an alert every three minutes. Now this is probably me getting the configuration wrong, but given I’ve replaced 5 with 120 in every field I can find (or 360 in some cases), the fact that I get a warning every three minutes is not encouraging.
Some of the false positives might be configuration errors. A lot are not, they’re problems with the product (such as sending my Linux server to Coventry). So I’ve dropped NRPE and stripped Nagios back to the core, only using the basic checks that come with it. They’ve only reported two problems, both of which were real. If this stabilises, fine. If not, I’ll go find an alternative product.