Collecting metrics

A few startups I worked at had a similar story. When they got started they didn't have any metric collection (maybe some system metric from their cloud provider, but nothing more). After a few times where they had to debug an issue where metrics were needed they decided to start collecting metrics from the application. Since they were a small team with little experience in setting up the needed infrastructure or the man power to handle such a task they decided to use a SaaS product (NewRelic and Datadog are both good choices here). Then as the company grew and the number of users, processes, components and instances grew so did the bill from the that SaaS. This is usually the time where they decide that they need an DevOps person on the team (not just because of the metrics issue, but as the company matures, the customer base grows, uptime requirements increase, scaling is an issue, etc.). What follows is my advice on such undertakings.

First of all, use StatsD. Not specifically the StatsD daemon but the protocol. It's mature, flexible enough for most tasks, supported in all languages and in all metrics collection software. You can even use netcat to push metrics from shell scripts.

You don't have to setup storage for the metrics you collect, CloudWatch, Librato or similar services can be used and are cheaper than the more integrated SaaS offering. If you have Elasticsearch already for log collection, you can use that for storing metrics as well (great for a few dozen instances). InfluxDB and Gnocchi support the StatsD protocol. As you can see, it's a flexible solution.

I think that most people associate the StatsD protocol with the Graphite protocol and software, but there is a key difference: StatsD uses UDP. It's faster and there's less overhead. You can default to sending the metrics to localhost and if nothing collects them that's still fine (great for local development and CI jobs where you don't want to run metrics collection software or where it doesn't make sense). I know that some would say that TCP is more reliable than UDP and you can lose metrics using it. To that I would say that the lower overhead of UDP, the lack of connection is actually an advantage. No connection closed errors, that single packet is more likely to reach its destination in case the system is under heavy load than opening a new TCP connection. You can read this GitHub blog post to see how they dealt with their metric loss.

I know that Prometheus is the new hotness and very popular when running Kubernetes and it's not a bad choice. But there are a few things that you need to keep in mind when considering it. It's best when you have service discovery available for it (if you're already using Kubernetes or running in a cloud provider that's not an issue but that's not everybody). Also, for short lived tasks (like Cron or processes that pull from a job queue) or for processes that can't open a listening socket (or if you have many of the same processes running on the same host), you need to setup a push gateway. The process pushes metrics to the gateway and Prometheus collects them afterwards (sounds an awful like the StatsD daemon, doesn't it). Prometheus has a StatsD exporter so you can still use StatsD along with Prometheus.

If you want some ad-hoc or more lightweight metrics collection, statsd-vis is a good solution. It holds the data in memory for a configurable time and has a builtin web UI to watch the graphs. I have a a very small Docker image for that.

Lastly, for Python applications I recommend the Markus library. It's easy to use, reliable and you add more metrics as needed. 2 thumbs up.