Healthchecks.io Hosting, Questions and Answers

The article Healthchecks.io Hosting Setup, 2022 Edition was recently on Hacker News. I was answering questions in the comments section. Here’s a recap of some of the questions and my answers. I’ve edited some of the questions for clarity.

Q: Can you share more details about what the 4 HAProxy servers are doing?

  • The traffic from the monitored systems comes with spikes. Looking at netdata graphs, currently the baseline is 600 requests/s, but there is a 2000 requests/s spike every minute, and 4000 requests/s spike every 10 minutes.
  • Want to maintain redundancy and capacity even when a load balancer is removed from DNS rotation (due to network problems, or for upgrade).

There are spare resources on the servers, especially RAM, and I could pack more things on fewer hosts. But, with Hetzner prices, why bother? 🙂

Q: Why Braintree, not Stripe?

When I started, Stripe was not yet available in my country, Latvia (it now is).

Personally I’ve had good experience with Braintree. Particularly their support has been impressively good – they take time to respond, but you can tell the support agents have deep knowledge of their system, they have access to tools to troubleshoot problems, and they don’t hesitate to escalate to engineering.

Q: I’d like to hear more on your usage of SSLMate and SOPs.

SSLMate it is a certificate reseller with a convenient (for me) interface – a CLI program. It’s no fun copy-pasting certificates from email attachments.

I’m using both RSA and ECDSA certificates (RSA for compatibility with old clients, ECDSA for efficiency). I’m not sure but looks like ECDSA is not yet generally available from Let’s Encrypt.

On sops: the secrets (passwords, API keys, access tokens) are sitting in an encrypted file (“vault”). When a Fabric task needs secrets to fill in a configuration file template, it calls sops to decrypt the vault. My Yubikey starts flashing, I tap the key, the Fabric task receives the secrets and can continue.

Q: I would love to hear more detail how WireGuard is set up.

I use vanilla Wireguard (the wg command and the wg-quick service). I set up new hosts and update peer configuration using Fabric tasks. It may sound messy, but works fine in practice. For example, to set up Wireguard on a new host:

  • On the new host, I run a Fabric task which generates a key pair and spits out the public key. The private key never leaves the server.
  • I paste the public key in a peer configuration template.
  • On every host that must be able to contact the new host, I run another Fabric task which updates the peer configuration from the template (wg syncconf).

One thing to watch out is any services that bind to the Wireguard network interface. I had to make sure on reboot they start after wg-quick.

Q: I am curious how sites like this handle scheduled tasks that have to run at high frequencies? Cron on one machine? Celery beat?

Healthchecks runs a loop of

10 send any due notifications
20 SLEEP 2
30 GOTO 10

The actual loop is of course a little more complicated, and is being run concurrently on several machines.

Q: How did you go about implementing the integrations (email, Signal, Discord….)?

Started with just the email integration, and added other integration types over time, one by one. A few were contributed as GitHub PRs.

The Signal one took by far the most effort to get going. But, for ideological reasons, I really wanted to have it 🙂 Unlike most other services, Signal doesn’t have public HTTP API for sending messages. Instead you have to run your own local Signal client and send messages through it. Healthchecks is using signal-cli.

Q: What volume of data are you storing in PostgreSQL? Any reason not to use a hosted PostgreSQL provider?

Around 200 write tx/s as a baseline. Spikes to 2000 write tx/s at the start of every minute, and 4000 write tx/s every 10 minutes.

Not using a hosted PostgreSQL provider for several reasons:

  • Cost
  • Schrems II
  • From what I remember, both Google Cloud SQL and AWS RDS used to have mandatory maintenance windows. The fail-over was not instant, so there was some unavoidable downtime every month. This was a while ago – maybe it is different now.

Q: Is the decision not to use Patroni for HA PostgreSQL in this case, so that you don’t add more complexity?

Yes. Plus, from reading database outage postmortems, I was not comfortable making the “do we fail-over now?” decision automatic. Think about the brownouts, where the primary is still up, but slow. Or it experiences intermittent packet loss.

I’ve automated the mechanics of the fail-over, but it still must be initiated manually.

Q: I’m getting the impression, the bus factor at Healthchecks.io seems to be 1. If I’d run a one man show type of business, I’d love to have some kind of plan B in case I’d be incapacitated for more than half a day.

Yes, the bus factor is 1, and it’s bugging me too. I think any realistic plan B involves expanding the team.

Q: How much does it all cost?

I don’t have a precise number, but somewhere in the €800/mo region.

Q: How do you think open-sourcing the self-hosted version of your product impacted your sales? Positively, negatively?

I can’t say definitely, but my gut feeling is positively.

What if another operator takes the source code, and starts a competing commercial service? I’ve seen very few (I think 1 or 2) instances of somebody attempting a commercial product based on Healthchecks open-source code. I think that’s because it’s just a lot of work to run the service professionally, and then even more work to find users and get people to pay for it.

What if a potential customer decides to self-host instead? I do see a good amount of enthusiasts and companies self-hosting their private Healthchecks instance. I’m fine with that. For one thing, the self-hosting users are all potential future clients of the hosted service. They are already familiar and happy with the product, I just need to sell the “as a service” part.