Monitoring shore.co.il

Published

Recently, I had some time to work on a project I had on my to-do list for a long time, monitoring services in shore.co.il. The project is now done and is available in my GitLab instance.

Requirements

When I write monitoring, I mean periodic checks on services and alerts if they fail. I had a specific requirement set in mind with this project. I wanted the monitoring to be reliable, meaning that if anything and everything in my infrastructure failed, I would still get alerts. This was critical for me since I run a lot of my infrastructure at home and a prolonged internet or power outage would bring down many services. Cheap and easy would also be nice.

Architecture

I decided on using Lambda functions along with SMS notifications from SNS on AWS. Lambda functions can be reliably triggered using CloudWatch Events on a schedule (every x minutes) and failures can be published to a SNS topic that has a target that sends SMS messages to my cellphone. So far, very reliable, no dependency on anything in my infrastructure. For added reliability, I added CloudWatch alerts in case a function failed to be invoked recently or if the invocation failed. Said alerts would also send me an SMS message. SMS messages cost a little (hopefully there would little of those), I don't have enough Lambda function invocation or runtime to go over the free tier and the price for the code in S3 isn't great either. For me, it was easier, cheaper and more reliable than setting up Nagios, Sensu or similar.

Solution

I wrote a few Python functions to test the different services I run (DNS, SMTP, IMAP, SSH, different web services). To deploy them I wrote a Terraform module that does everything from creating the SNS topic, upload the Python code and hook up the Lambda functions. Everything is ran inside a GitLab CI pipeline and uses the GitLab remote Terraform state (I recently had reason to try it out and I was impressed).

Conclusions

I don't think I would set up this specific solution for a company. A company would most likely have an on-call schedule. Maybe using a SaaS product would be easier and better in some aspects (like running checks from multiple locations). But for my small infrastructure and considerations it was a success. The project can be adapted to use a service like PagerDuty to have an on-call schedule and it can be deployed to multiple regions to run checks from multiple regions. Lastly, Nagios and Sensu have a library of ready checks in Ruby or Perl so you don't have to write them yourself. This project has been live for more than a week now and has been reliable. The AWS Cost Explorer predicts that the cost for this month would be a few dollars. I call it a success.