GridPP Monitoring News: 2014

Wednesday 21 May 2014

Icinga 2 Beta being released next week

Icinga has announced a feature-freeze on the Icinga 2 Beta, with a release date for it scheduled for next Tuesday: see more details here for information on how to try it out.

Wednesday 14 May 2014

Status Board and Glasgow

Last year about this time, Panic released Status Board for the iPad. At the time I was really impressed with it, but we weren’t really set up to make use of it. In the interim, we’ve refreshed our monitoring and in particular are now using Graphite. Over the past couple of days I’ve been working with a couple of types of status update:

Naemon/nagios statuses
Graphite graphing

For the Naemon status updates, I’m using a python script using livestatus to grab the status of key probes. Some of these are site level statuses (in the screen below ATLAS CREAM and ATLAS SRMv2 are taken from the LCG SAM probe discussed a few weeks ago) and some are taken from looking at the cluster as a whole. In those cases (see for example LOADAVG) this alerts if at least one probe goes critical - I can then go and investigate further. At the moment, I’ve also set it so that if an alarm is acknowledged, this board switches back to OK for that probe - for the moment I’m more concerned with things I have to worry about that something that I understand, even though it might not be fully resolved.

The graphs take the raw json data from graphite and, for the moment, do the simple thing of using the CSV formatting that Status Board uses by default. The next step is to tidy this up by using the custom formatting available - but even as it stands I think it works pretty well.

There’s space left at one side for a to do list or something else - a project list, for example - once I work out the most useful way to work with that.

Statusboard small

New version of Grafana

After Monitorama, a new version of Grafana is now available, version 1.5.4. Check here for details and to download.

Monitorama videos

I meant to watch the live stream from Monitorama a couple of weeks ago, but the time difference made that a bit more difficult. Fortunately, however, the individual videos are now becoming available. Check out the Monitorama vimeo channel for more details: look for the "Monitorama PDX 2014” tag.

Wednesday 7 May 2014

Icinga turns 5

Icinga, the very popular fork from Nagios started in 2009, turned 5 yesterday. You can read more about it here - you can also find links from there to progress on Icinga 2 and Icinga Web 2.

Friday 25 April 2014

Monitorama

In a few days time the next Monitorama monitoring conference will take place in Portland - it’s running from May 5th-7th. It looks like there’s going to be a great set of talks and workshops - they’re planning to have these streamed at the time and available afterwards.

Monitoring Plugins

Having a quick check on the status of the monitoring-plugins and nagios-plugins packages, it looks like developments are continuing on both - monitoring-plugins is currently in review to be added to EPEL.

Thursday 17 April 2014

LCGSAM Nagios Plugin

The guys at PIC have been working on a new Nagios Plugin to read the LCG SAM data for your grid site. The Nagios Plugin LCG SAM uses a JSON feed to get up to date status information for your site, setting your vo (-v), profile (-p) and site (-s) , as well as optionally the test flavour (-f). It uses jq, a JSON processor for bash, which looks very interesting in its own right.

At Glasgow we’ve been using this test for the last few weeks splitting out ATLAS probes, with a CREAM-CE flavour and SRMv2 flavour. It has been reliable and useful in showing our status in our nagios/naemon dashboard.

Allokusg

All OK

New features in Graphite 0.10.0

Graphite has a new version under development, 0.10.0 - we’re currently running 0.9.12 from the EPEL repo. The full change list for the new version can be found here but I wanted to note a couple of the features:

Metrics can be reordered via the composer.
manage.py is no longer available. The alternative is to use the django-admin.py command provided by Django.

No word on the release date for this update has been given yet, but we’ll be testing it when it comes out.

Git and Naemon

We've started using git a lot more at Glasgow, and I've been working on using it to manage our new documentation. We've set up a number of remote git repositories to enable this. This has been working well for website development recently, and so I turned to the possibilities of using it elsewhere.

Developing our monitoring boxes, I wanted to make sure that a new server was configured as cleanly as possible (we're using puppet for new server configuration). I was looking for a way to efficiently keep nagios/naemon configs up to date across different machines, and it occurred to me that using a remote git repository was an ideal solution for this. Creating a new blank repository and copying in our current configuration, I could push and pull configs between machines easily - which is entirely the point of using git here, but it's nice to see it work. I'm now working to incorporate this into the build step so that a new machine pulls its config automatically.

We're now looking at further uses of git across the cluster (we have a number of fans at the site!)

Naemon

As I mentioned in the last post, we've started using Naemon at Glasgow. Naemon is a fork of Nagios 4.0, largely lead by Andreas Ericsson who worked on Nagios 4.0 - for more information on the origins of the package, see the project page for Naemon here. It comes packaged with Thruk as it's frontend. While a young project (the first stable release, 0.8.0, was in February) it has already been working well for us and looks like it has a promising future.

Naemon and Thruk

At the last GridPP meeting in Pitlochry, I gave a talk [Indico link] where I mentioned the new Nagios 4.0 fork Naemon (more on that later). One of the screenshots I showed was of the great web frontend which comes with Naemon, called Thruk. I wanted to highlight a particular feature included in Thruk called Mine Map (screenshot below) which maps out every service and host in a grid showing the probe status in each case - everything here looks nice and green, which is pleasing.

The nice thing about this, of course, is that you can quickly pick out any problem areas quickly (We're using the Exfoliation theme, by the way). However, this gets even better when you use the Display filters at the top of the screen. One of the problems with seeing issues when most of the tests are green is picking out small patches of criticals, particularly if you have to scroll through a large estate to look for issues.

Using the Display filters, however, you can choose to only show particular Services States - Critical, Warning, Pending or Unknown for example - and only show issues which have not been acknowledged. If I apply those settings now the display changes to something like the following:

Now we can see that we have two unacknowledged issues - a CVMFS problem on node296 and a problem with the services on svr008 (one of them is about to be rebuilt and one of them has just been rebuilt and so I'm keeping an eye on it).

So now I have a view of my Naemon/Nagios tests, with everything laid out in front of me and only the most urgent issues highlighted. You can also add extra filters, including grepping for particular hostnames, services or groups.

GridPP Monitoring News