For a monitoring service, uptime and reliability is of course a critical feature: customers are placing trust in the service to detect problems and deliver timely and accurate alerts. While I cannot guarantee that Healthchecks.io will absolutely never let you down, I can offer transparency on how it is currently being hosted and operated.
I will use bullet point lists liberally in this blog post (otherwise, it could turn into “The Opinionated and Frugal SRE Book”). Let’s go:
- Main values: Simple is good. Efficient is good. Less is more.
- The core infrastructure runs on Hetzner bare metal machines. Hetzner offers amazing value for money and is a big part of the reason why Healthchecks.io can offer its current pricing.
- No containers, no auto-scaling, no “serverless”. Plain old servers, each dedicated to a single role: “load balancer”, “application server” and “database server”.
- The machines are closer to “pets” than “cattle”: I have provisioning scripts to set up new ones relatively quickly, but in practice the working set of machines changes rarely. For example, the primary database server currently has an uptime of 375 days.
- Hardware: Two Intel 9900K machines, costing €69/month each.
- Software: HAProxy.
- Fault tolerance: A watchdog script on an external VM, detects unhealthy load balancers and removes their DNS records.
- Supports HTTP/2. Supports IPv6. Uses ECDSA certificates with modern clients and falls back to RSA certificates with old clients.
- Used Loader.io to test performance-related configuration changes.
Before going HAProxy, I used Cloudflare Load Balancer. Compared to Cloudflare,
- The monthly running cost ends up being slightly lower but similar.
- It took more work to research, set up, and maintain the dedicated load balancer servers.
- HTTP request latencies seen from clients are much more consistent. Cloudflare was great on some days, and not so great on others.
- More control over the load balancer configuration, and more options to diagnose problems.
- Hardware: Three Intel 6700/7700 machines, costing €39/month each.
- Software: Nginx -> uWSGI -> Django.
- Fault tolerance: load balancers detect unhealthy or “in maintenance mode” machines and remove them from rotation.
- For daemon processes that need to be constantly running (“sendalerts”, “sendreports”) I use systemd services.
- Communication with load balancers is over private network (Hetzner’s vSwitch).
Side note: when I started out, I took the private network idea to the extreme: I asked Hetzner support to install two NICs in each machine, and to connect the secondary NICs using a dedicated switch. A custom order like that takes more time (days instead of minutes) and costs more, but is possible. I later moved away from that setup, for a couple of reasons:
- All servers had to be in the same rack. Cannot add more machines if the rack is full.
- Also, with this setup, one critical switch failure could take out all of the servers.
- Hardware: Two Intel Xeon E3-1275 machines, ECC RAM, NVMe storage, costing €59/month each.
- Software: PostgreSQL 10.
- Primary/hot standby setup.
- Fault tolerance: I am not brave enough to completely automate database failover – I have a tested procedure to perform the failover, but the decision has to be made manually.
- Application servers specify multiple hosts in the database connection strings. When primary changes, the applications can fail over to the new primary without any additional configuration changes.
- Backups: Full daily backups encrypted with GPG and uploaded to S3. I keep 2 months worth of backups and delete older ones.
- OpsDash agent and hosted dashboard for a “big picture” view of the servers, and alerting.
- Netdata agent on each machine for investigating specific issues.
- Four VMs in different locations sending regular pings, and logging everything, including TCP packet captures.
- Logs from the VMs aggregated to Papertrail. Papertrail sends alerts on specific log events.
- An always-on laptop dedicated to showing live Papertrail logs.
- Yubikey for signing commits, logging into servers and decrypting database backups. For initial setup, I used DrDuh’s guide.
- A small laptop with development and deployment environments, and another Yubikey. Travels with me when I leave home.
- Fabric scripts for server provisioning, updates and maintenance.
- Operated by one person. I’m not in front of my PC 100% of the time, so incidents can take time to fix.
- For serving the ping endpoints (hc-ping.com, hchk.io) efficiently, I wrote a small Go application, which I have not open sourced.
Last but not least, I am eating my own dog food and am monitoring many of the periodic and maintenance tasks using Healthchecks.io itself. For example, if the load balancer watchdog silently fails, I might not notice until much later when it fails to do its duty. With Healthchecks.io monitoring, if the watchdog runs into issues (the script crashes, VM loses network connectivity, anything) I receive an email and an SMS alert within minutes.
Thanks for reading,
– Pēteris, Healthchecks.io