Pēteris Caune

Incident Report – 27 October 2019

On 27 October around 18:30 UTC one of the load balancers serving healthchecks.io, hc-ping.com and hchk.io started experiencing network issues. A monitoring script correctly detected this and removed the affected load balancer from DNS rotation, but some requests still got lost while the DNS changes propagated.

After receiving a monitoring alert I jumped in to investigate. The load balancers and backend servers are set up to communicate over a virtual private network (Hetzner’s vSwitch feature). The problematic load balancer was intermittently unable to reach the backend servers over their private IP addresses. Haproxy was reporting backends sporadically flipping between healthy / unhealthy states.

I opened a support request with Hetzner at 19:03 UTC, and in the meantime tried various things to fix the issue myself. Hetzner support confirmed they are dealing with an issue and at 22:21 UTC they reported the issue was resolved. The load balancer could reach all its peers over their vSwitch IP addresses again. I monitored connectivity for one more hour and then added the load balancer back to DNS rotation.

At the time of writing this, I don’t yet know what was the exact root cause was. The vSwitch configuration had worked with absolutely no problems for many months, and this incident came seemingly out of the blue. I requested more details from Hetzner support and so far they’ve said they are still analyzing this incident, but the vSwitch feature itself is considered “very stable”. I have not made up my mind whether to replace the vSwitch networking with something else. I don’t want to make hasty decisions, and will first wait for more information from Hetzner.

In summary, the good:

  • A monitoring script correctly detected a malfunctioning load balancer and removed it from DNS rotation.
  • Luckily I was able to investigate immediately (the incident happened on Sunday at 8:30PM local time).
  • Quick communication from Hetzner support: 7 updates from them in 3 hours.
  • And, what really matters in the end: they got the issue fixed.
  • The other load balancer remained fully functional the whole time. Some of the uptime monitoring services didn’t even notice an outage.

And the bad:

  • Some pings from client systems did get lost. This likely resulted in a number of false “X is down” alerts after the outage.

I apologize to all Healthchecks.io users for any inconvenience caused. For a monitoring service, any downtime is unacceptable, and I will continue to look for any opportunities to make the service more robust.

– Pēteris Caune,
Healthchecks.io

A Look at Healthchecks.io Hosting Setup, Summer 2019

For a monitoring service, uptime and reliability is of course a critical feature: customers are placing trust in the service to detect problems and deliver timely and accurate alerts. While I cannot guarantee that Healthchecks.io will absolutely never let you down, I can offer transparency on how it is currently being hosted and operated.

I will use bullet point lists liberally in this blog post (otherwise, it could turn into “The Opinionated and Frugal SRE Book”). Let’s go:

  • Main values: Simple is good. Efficient is good. Less is more.
  • The core infrastructure runs on Hetzner bare metal machines. Hetzner offers amazing value for money and is a big part of the reason why Healthchecks.io can offer its current pricing.
  • No containers, no auto-scaling, no “serverless”. Plain old servers, each dedicated to a single role: “load balancer”, “application server” and “database server”.
  • The machines are closer to “pets” than “cattle”: I have provisioning scripts to set up new ones relatively quickly, but in practice the working set of machines changes rarely. For example, the primary database server currently has an uptime of 375 days.

Load balancers

  • Hardware: Two Intel 9900K machines, costing €69/month each.
  • Software: HAProxy.
  • Fault tolerance: A watchdog script on an external VM, detects unhealthy load balancers and removes their DNS records.
  • Supports HTTP/2. Supports IPv6. Uses ECDSA certificates with modern clients and falls back to RSA certificates with old clients.
  • Used Loader.io to test performance-related configuration changes.

Before going HAProxy, I used Cloudflare Load Balancer. Compared to Cloudflare,

  • The monthly running cost ends up being slightly lower but similar.
  • It took more work to research, set up, and maintain the dedicated load balancer servers.
  • HTTP request latencies seen from clients are much more consistent. Cloudflare was great on some days, and not so great on others.
  • More control over the load balancer configuration, and more options to diagnose problems.

Application Servers

  • Hardware: Three Intel 6700/7700 machines, costing €39/month each.
  • Software: Nginx -> uWSGI -> Django.
  • Fault tolerance: load balancers detect unhealthy or “in maintenance mode” machines and remove them from rotation.
  • For daemon processes that need to be constantly running (“sendalerts”, “sendreports”) I use systemd services.
  • Communication with load balancers is over private network (Hetzner’s vSwitch).

Side note: when I started out, I took the private network idea to the extreme: I asked Hetzner support to install two NICs in each machine, and to connect the secondary NICs using a dedicated switch. A custom order like that takes more time (days instead of minutes) and costs more, but is possible. I later moved away from that setup, for a couple of reasons:

  • All servers had to be in the same rack. Cannot add more machines if the rack is full.
  • Also, with this setup, one critical switch failure could take out all of the servers.

Database

  • Hardware: Two Intel Xeon E3-1275 machines, ECC RAM, NVMe storage, costing €59/month each.
  • Software: PostgreSQL 10.
  • Primary/hot standby setup.
  • Fault tolerance: I am not brave enough to completely automate database failover – I have a tested procedure to perform the failover, but the decision has to be made manually.
  • Application servers specify multiple hosts in the database connection strings. When primary changes, the applications can fail over to the new primary without any additional configuration changes.
  • Backups: Full daily backups encrypted with GPG and uploaded to S3. I keep 2 months worth of backups and delete older ones.

Monitoring

  • OpsDash agent and hosted dashboard for a “big picture” view of the servers, and alerting.
  • Netdata agent on each machine for investigating specific issues.
  • Four VMs in different locations sending regular pings, and logging everything, including TCP packet captures.
  • Logs from the VMs aggregated to Papertrail. Papertrail sends alerts on specific log events.
  • An always-on laptop dedicated to showing live Papertrail logs.
My workplace, Papertrail logs in the top right screen

Ops

  • Yubikey for signing commits, logging into servers and decrypting database backups. For initial setup, I used DrDuh’s guide.
  • A small laptop with development and deployment environments, and another Yubikey. Travels with me when I leave home.
  • Fabric scripts for server provisioning, updates and maintenance.
  • Operated by one person. I’m not in front of my PC 100% of the time, so incidents can take time to fix.

Custom Bits

  • I maintain a private branch of the Healthchecks project with non-public customizations: branding, customized footer template, the “About”, “FAQ”, “Privacy Policy” pages and similar.
  • For serving the ping endpoints (hc-ping.com, hchk.io) efficiently, I wrote a small Go application, which I have not open sourced.

Last but not least, I am eating my own dog food and am monitoring many of the periodic and maintenance tasks using Healthchecks.io itself. For example, if the load balancer watchdog silently fails, I might not notice until much later when it fails to do its duty. With Healthchecks.io monitoring, if the watchdog runs into issues (the script crashes, VM loses network connectivity, anything) I receive an email and an SMS alert within minutes.

Thanks for reading,
– Pēteris, Healthchecks.io

Healthchecks @ PyCon Lithuania ’19

Foreword: In May this year, I had the honour to speak at PyCon Lithuania about Healthchecks. Having practically no public speaking experience, I prepared carefully. As a part of the preparation, I had the whole speech written out. This saved me from lots of awkwardness during the talk, but also makes it easy to share the final version here in a readable form. Below are the slides and the spoken text from my talk “Building an Open-source Django Side-project Into a Business” at PyCon Lithuania 2019. Enjoy!

Hello, my name is Pēteris Caune, I am happy to be here at PyconLT and this talk is about my side-project – written in Python – it’s called Healthchecks.

I will talk about my motivation for building Healthchecks, and about some of the notable events and challenges, about the project’s current state, how things have worked out so far, and future plans.

Before we get into it, here’s the current status of the project. Healthchecks is almost 4 years old. It has a bunch of paying customers and it comfortably covers its running costs. But – it is not yet paying for my time. Basically, currently Healthchecks is a side project, and the medium term plan is for it to become a lifestyle business. Keep this in mind as a context – with this project, I’m not focusing on profitability and am fine with it growing slowly. For instance, I’m not using aggressive marketing and aggressive pricing. If in your project you do need to be as profitable as possible, as soon as possible, you will want to do a few things differently than me.

So, with that in mind, let’s go!

Healthchecks is a Django app for monitoring cron jobs on servers. It uses the Dead Man’s hand principle: you set up your cron job or background service to make small, simple HTTP requests to the Healthchecks server.

Each HTTP request is like a message saying “I’m still alive!”, it is a sort of a heartbeat.

The Healthchecks server looks for these “heartbeat” messages.

And it sends you alerts when they don’t arrive at the expected time. That’s the basic idea. And it’s a simple idea and I was not the first one to think of it.

About four years ago, in 2015 I was looking for a tool like this. And I found two services, that already existed, DeadMansSnitch and, recently launched, Cronitor.

They looked good, but they seemed pricey, and I could not justify paying for them for the types of things I wanted to monitor at the time.

So I was bouncing around this idea for a while – “maybe I just build this myself…?” In June 2015 I started to hack on it and made the first git commits. It seemed like a fun thing to work on. I liked how conceptually simple the basic idea was.

I set a slightly ambitious goal to build a service that works as well or better than the existing services, and is offered for free or for significantly cheaper. This was the challenge: I was not interested in just cloning Dead Mans Snitch or Cronitor, and having similar pricing, and just competing with them.

Another aspect of motivation for this was: In my day job as a full stack developer I do have certain amount of influence on the various decisions. But, ultimately, I’m working on somebody else’s product. And I must be careful not to get emotionally attached to my work and its fate.

Now, with Healthchecks, I would have full control over what this thing becomes. And I would be in charge of everything: product design, UI design, marketing activities or lack thereof, customer support – talking with customers directly, taxes and legal stuff. These are the types of things that I don’t normally do as a developer.

So this would require me to get out of the comfort zone, face new challenges and learn new skills, which is good. By the way, me talking here at PyconLT is also a form of such a challenge: public speaking! I’m an introverted person – so, ya – this is a challenge!

Another line of thought was – even if the project doesn’t go anywhere, it would look good on my CV, and be useful that way.

So, that’s the motivation.

Healthchecks uses Django framework. I chose Django because I was already familiar with it. I knew I would be immediately productive. I started with a simple setup: a Django web application and a Postgres database and almost nothing more. My plan was to try to keep it as simple as possible and see how far I could get without complicating it with additional components. I set out to get the basic functionality working, then work on polishing the UI and, basically, see where it goes.

So, June 2015, I made the first commit to a public Github repository. I decided to make this open source to differentiate from existing competitors. Also, developing this in the open would be a little bit like a reality show – everyone can see how my work is progressing, and how crappy or not crappy my code is. Sounds fun, right?

A month later I deployed the code to a $5/mo DigitalOcean droplet, I bought the healthchecks.io domain and an SSL certificate. I made a “Show Hacker News” post (which did not go anywhere). And, like that, the project was live!

Later in 2015, I made a blog article about some technical aspects of the project, and submited it to Hacker News, of course, and this one did get on the front page and brought in a good amount of visitors. Both the webapp and the database was still on a single DigitalOcean droplet, but I had moved it up to a $20/mo in anticipation of traffic.

Six months in, I started to see server’s CPU usage climbing. CPU was mostly being spent handling the pings – the incoming HTTP requests sent from the cron jobs. These are simple requests but the volume was steadily going up. I used this as an excuse to learn a little bit of go-the-language and I wrote the ping handler in Go. It had a significantly smaller per-request overhead compared to Django.

My naive go code was responsible for several outages later down the road. It’s not the fault of the language of course, it’s just me not thinking about various failure scenarios, and performance degradation scenarios, and me not properly testing them.

Around the same time I was also setting up paid plans. At the launch there was only an unlimited free plan, but I needed to generate revenue, in order to at least pay for the servers. Otherwise, the project would not be sustainable long term.

First of all, I was and still am working as a contractor. Since I need to pay my taxes, I already had a registered company before diving into Healthchecks. This was fortunate because it was one less barrier for me to start the project. Taxes for the company are handled by outsourced accountants, which has been very helpful. I’m fine with paying taxes, but all the paperwork and burreaucracy I do not enjoy at all, so I try to avoid all of that as much as possible.

So, for payments, I looked at Stripe but it was not available in Latvia at the time.

I then looked at Braintree which seemed OK. Setting up an account with them was easy enough. There was a fair bit of development work on the Healthchecks side, to integrate with their API and to build out the functionality for entering a payment method, entering billing details, selecting a plan, generating invoices.

Initially, the price for the paid plan was $5/mo. The free plan was still practically unlimited, meaning there was little incentive to upgrade.

Nine months in Healthchecks got its first paying customer! $5 MRR.

One year after starting the project, I moved from a single server setup to the web server and the database hosted on separate DigitalOcean droplets.

The web server used a floating IP. When I needed to deploy new code, I would create a new droplet, deploy the new code to it, do some smoke testing and if everything looked OK – switch the floating IP over to the new droplet. And I kept the old droplet around for a while so I could switch back in case of problems.

For deployment I used – and still use – Fabric with fabtools. I looked into Ansible as well, and used it for a while, but ultimately went back to Fabric because it’s less complex, less magic, is easier to reason about, and the deployments ran much faster.

I’m still using Fabric version 1.something, which uses Python 2.7. Not ideal. So I will need to deal with this in not too distant future, but it works fine for now.

August 2016, Healthchecks had an about 24 hour outage. I was away on a road trip to Estonia, I was not checking phone, and was completely unaware of the outage. When I returned home, my inbox was full emails, twitter was full of notifications and there was panic going on in github issues.

After this incident I did a few things, one was to get a dedicated second hand laptop with a working battery, and set up a development environment on it. It has a full disk encryption. It has a yubikey nano plugged in, for signing git commits and for SSH-ing into servers. It now comes with me wherever I go so I can fix issues when I’m away from home.

In 2017 I moved project’s hosting multiple times, as I was trying to improve the reliability, quality and fault tolerance of the service.

In April, I moved the Postgres database to compose.io, which is a DBaaS provider. The idea was that Compose.io would take care of managing database replication and automatic failover. It sounded good on paper and everything looked good in my preliminary testing, but, once I switched the production traffic to Compose, I was having capacity issues. In Compose you can scale up your database capacity – and your monthly bill – by simply moving sliders in their UI. I had to scale up to a point where it would be too expensive for me. So… Same month, I moved back to my previous setup.

By then, I had a clear idea of what I’m looking for in a hosting provider. One of the crucial things was a load balancer that could handle traffic spikes and do lots of TLS handshakes per second. If a load balancer can nominally do, say, 200 new HTTPS connections per second, but the site sees 1000 during traffic spikes then that’s no good. And the nature of cron jobs with common schedules is that traffic does come in waves.

Google’s Cloud Load Balancer looked like a good option – it is, you know – Google scale! In May 2017 I moved the service to Google Cloud Platform. They had also recently launched managed Cloud Database service, which was nice, and I made use of that as well.

Google’s load balancer was handling any amount of TLS handshakes fine, but I was seeing occassional failed requests in the load balancer’s logs. And I spent good amount of time searching for solutions and troubleshooting – I was trying everything I could think of on my end, things like tweaking nginx parameters, I was also tweaking network-related kernel’s parameters. I opened issues with Google Cloud Platform’s customer support – they were very polite and willing to help, but didn’t seem to have the expertise or the access to engineers with the expertise.

They did suggest a few trivial things I had already tried. They found a relevant reddit post and sent me a link to it. Funny thing is, that post was written by me, I was documenting my issue and asking for advice there.

In short, I was unable to fix the issue with the failed requests, and I was looking for other options.

In October I moved to Hetzner for hosting, and to Cloudflare for load balancing. This is how the service is running today, I’m still using Hetzner and Cloudflare. By the way, Hetzner is a German hosting provider that has really good prices for bare metal servers.

OK, so I’ve been rehearsing this talk in the past week and this was the phrasing that was suggested by my brother: “I had an interesting problem – people wanted to pay me”.

So yeah, I needed to set up American Express payment processing in Braintree. The setup involved filling a few scary looking forms, printing and signing a contract with American Express, then scanning it and sending it back to them. In hindight, it wasn’t too hard. But the feeling at the time was, – “oh this is getting serious!”

In March 2018 I increased the paid plan’s price from $5 to $20 and tightened the free plan’s limits. I decided to do this after watching and reading a bunch of Patrick McKenzie’s talks, and his podcasts and blog posts that all had one main theme: “charge more!”.

I didn’t feel this was betraying the project’s original mission. After the change, Healthchecks still had the free plan with generous limits. The free plan would be aimed at individual users and maybe small teams with no budget. But for the paid plan, which was aimed at companies, $5 or $20 should not make a difference I thought.

And, after the pricing change I didn’t get any negative feedback, which was nice, I was still seeing new signups, and the monthly recurring revenue graph started to look a bit more promising.

Last year around this time everyone was busy implementing the changes needed for GDPR compliance. And so was I. Luckily for me, the service was already in mostly good shape technically. It was not using any analytics services, so no cookie worries, it was not collecting unneeded personal data. Well, it does need to collect email addresses for sending notifications, and the users can specify mobile numbers for receiving text notifications, and, for the paid users, if they need proper invoices, then they have to enter their billing information of course. But that’s about it. I also needed to update the Privacy Notice document, like most everyone else.

In parallel to these events – chasing down network reliability issues and changing hosting providers, I’ve of course also been working on adding new features and improving the existing features based on user feedback.

Now we are in May 2019, and here are some quick stats about the project. Healthchecks has over 6500 active user accounts, it is processing about 10M pings per day. That works out to a little over 100 pings per second. So – not too crazy but keep in mind the traffic is spikey.
Healthchecks.io currently has about 120 paying customers, and he monthly revenue is $1600. A chunk of that goes back into running costs, and in taxes, overheads and what not.

Healthchecks is still running on Hetzner’s bare metal servers. The Postgres database has a primary and a hot standby, and the failover is manual. I can trigger the failover with a single command, but, yes, that command is manual. I simply don’t trust myself to anticipate all the corner cases for doing this automatically. I’ve seen dedicated teams of people smarter than me mess this up, so I’m accepting that, for time being, the failover is manual.

At a high level, the app is still simple like it was in the beginning: a few load balanced web servers running the Django app and a Postgres database (and the ping handler written in go). There are no queues, no key value stores, no fancy distributed stuff – because as long as I can get away without using them, I want to keep things simple. That’s the theme here: simple and cheap.

The Django app is still open source, and I know it is being self-hosted by more than a few people. I think open sourcing it was the right decision. I’m getting code contributions and bugfixes from time to time – mostly minor stuff but still very much appreciated.

So here’s something to think about: does the self-hosting option hurt my sales? I cannot say for sure but I estimate that, if yes, then – not by much. I think the self-hosting crowd falls into two groups: homelab enthusiasts who want to run stuff themselves for the fun of it, and companies who want to self-host because of custom needs or policy and compliance reasons. Neither group would be likely to be on a paid plan on the hosted version. So no big loss there.

Another question would be: if I look back at my “mission statement”, did I succeed? The service has lots of happy customers. Pretty much every support request starts with a “thank you for the great service”. I have also been sneakily using Slack App Directory to compare the popularity of Healthchecks and its competitors. And, at least according to this metric, Healthchecks is the most used one. So that’s good. But the big caveat is, Healthchecks is not yet paying for my time. I cannot yet afford to work on it full time. It is growing steadily though. An, luckily, I am in a fortunate position where I can afford to let it grow steadily and slowly.

Also, it has taught me a lot. It is a great addition to my CV.

My future plans are to keep making continuous improvements to the codebase based on user feedback. Continue the work on reliability and robustness improvements. Reach a point where it pays for my time and is not a hobby project any more, whenever that happens. After that, reach a point where it pays for a second person, so it does not rely on me alone.

Working on Healthchecks.io has has been great experience. I still enjoy working on it and I look forward to do more of it. This is my talk, thanks for listening!

How I monitor the Ingress Sojourner Medal using Healthchecks.io

The primary intended use, in the case of Healthchecks.io, is to monitor the regularly running tasks on servers, such as cron jobs. However, the “alert me if X doesn’t happen on time” functionality can be useful in many other contexts too. One of the quirkier ways I’ve been personally using Healthchecks.io is to help my progress towards the Sojourner medal in Ingress.

Ingress is a location-based, augmented-reality game. It is a predecessor of Pokemon GO, and is made by the same company. In Ingress, players visit physical points of interest (“portals”) and perform various in-game actions on them. One of the gamification features in Ingress is Medals. Medals are awarded to players as they reach set milestones. For example, the “Trekker” medal is awarded for walking a long distance while playing Ingress:

Bronze: 10 km, Silver: 100 km, Gold: 300 km, Platinum: 1000 km, Black: 2500 km

Another medal in Ingress is Sojourner. For the Sojourner medal, the player must visit and “hack” portals in consecutive 24 hour periods:

Bronze: 15-day hacking streak, Silver: 30-day hacking streak, Gold: 60-day hacking streak, Platinum: 180-day hacking streak, Black: 360-day hacking streak

The black Sojourner medal is notoriously hard to obtain. If the player misses their 24 hour window, the current hacking streak resets to zero, and they have to start from scratch. The player must keep track of “When was my last hack? How much time do I have left?” day in day out. My own highest hacking streak is 233 days. Everything was going well until one day I forgot to go outside and do my “daily hack” on time. And just like that, I was back to square one!

After a period of “I’m never doing this again!”, I started another attempt. This time I was also using Healthchecks.io to keep track of how much time I have left each day and to remind me when the time is about to run out. Here’s how this works: I’ve set up a check on Healthchecks.io with a period of 20 hours:

For notifications, I’m using Pushover with the “emergency” priority: if the “Daily Hack” check doesn’t get pinged at least every 20 hours, my phone will get a notification. The “emergency” priority causes the notification to be repeated every 5 minutes until it is acknowledged. The emergency notifications also ignore the “Do Not Disturb” mode on the phone.

I will not be reverse engineering the Ingress mobile app, so I also need a convenient way to manually send HTTP requests to the “Daily Hack” check when I’m returning home from playing. I’ve found the HTTP Shortcuts Android app and its home screen widget useful for that.

Tap on “Daily Hack” and the HTTP Shortcuts app fires off an HTTP request to Healthchecks.io

I also bought an AWS IoT button and stuck it on the inside of my apartment’s front door. When passing the door, I can press the button, which triggers a Lambda function, which then pings Healthchecks.io. Here is a quick demonstration. In practice, I am using the phone’s home screen widget most of the time, but this was a good excuse to play around with the AWS IoT button.

And that is all! Whenever 20 hours pass without a ping to Healthchecks.io, my phone then sounds notifications every 5 minutes. At this point, I still have 4 hours left to go out and hack a portal. I don’t silence the notifications until the hack is done–just to be safe and stay “motivated” by the annoying notifications. This setup has been working well for me so far. My current hacking streak is 201 days and counting. Fingers crossed!

Do you also have found an unconventional or creative use for Healthchecks.io? Please let me know via Twitter or email!

Pēteris,
Healthchecks.io

Introducing Projects on Healthchecks.io

If your Healthchecks account is growing and is getting a little hard to manage, Healthchecks.io has a new feature for you: Projects. Use Projects to organize your monitoring dashboards, to have finer-grained team and API access controls, and to simplify your check-integration mapping.

You start will all your existing checks in your default project. You can create more projects and move your existing checks between them. There is no limit to the number of projects you can have. However, your account-level check limit is shared by the projects you own. For example, if you are on the Hobbyist plan (free, 20 check limit) and have three projects, the total number of checks across all three projects must not exceed 20.

Overview of Your Projects

Click on the Healthchecks.io logo in top navigation to see an overview of your projects. You can also use the drop-down menu in the top navigation to switch between projects and to access the “Project Settings” pages. The red badges show the number of checks currently down:

Project Settings

API keys and team memberships are now project-scoped. In the “Project Settings” page you can generate the API keys and invite people to the project’s team. The “Project Settings” page is currently only available to the project owner.

Transferring Checks Between Projects

You can move checks between the projects you have access to. For example, you can create a check in your personal account, but later decide to move it to your company account. The check’s URL and ping history is preserved, but the enabled integrations change: the check loses its current integrations, and gets assigned all integrations of the new project.

This is the initial release of Projects. I will appreciate any and all feedback as I keep iterating on the interface and the features.

Happy monitoring,
Pēteris,
Healthchecks.io

Investigating Gmail’s “This message seems dangerous”

I’ve been receiving multiple user reports that Gmail shows a red “This message seems dangerous” banner above some of the emails sent by Healthchecks.io. I’ve even seen some myself:

Gmail’s “This message seems dangerous” banner in action

The banner goes away after pressing “Looks safe”. And then, some time and some emails later, it is back.

It’s hard to know what exactly is causing the “This message seems dangerous” banner. Google of course won’t disclose the exact conditions for triggering it. The best I can do is try every fix I can think of.

Here’s what I’ve done so far.

SPF and DMARC records

For sending emails, Healthchecks.io uses Amazon SES. It’s super easy to set up DKIM records with AWS SES so that was already done from the beginning.

I’ve now also:

  • Configured a custom MAIL FROM domain (“mail.healthchecks.io”)
  • Added SPF and DMARC DNS records, and tested them with multiple online tools
Gmail seems to be happy with SPF, DKIM and DMARC

Update 20 November 2018: In DMARC reports, I’m noticing that a significant number of emails are failing both SPF and DKIM:

A section of DMARC weekly digest from Postmark

Apparently, last week, Google has processed more than 6000 emails that fail both SPF and DKIM checks. I’m thinking of two possibilities:

  • an email forwarder is changing email contents (adding a tracking pixel, adding a “Scanned by Antivirus XYZ” note or similar). With email contents changed, DKIM signatures are no longer valid.
  • somebody is spoofing emails from healthchecks.io addresses

Either way, I want to see if these messages are the culprit, so I’m changing the DMARC policy from “none” to “reject”. This instructs GMail to ignore and throw away email messages that fail both SPF and DKIM checks. Let’s see what happens!

List-Unsubscribe Header

Healthchecks.io notifications and monthly reports have always had an “Unsubscribe” link in the footer. I’ve now also added a “List-Unsubscribe” message header. Gmail seems to know how to use it:

Gmail shows an additional “Unsubscribe” link next to sender’s address. Clicking it brings up a neat confirmation dialog.

Maybe Gmail also looks for it as a spam/not-spam signal. As I said — I’m trying everything.

More Careful Handling of Monthly Reports

I’ looking into reducing the bounce and complaint rates of the monthly reports. The rates as currently reported by AWS SES:

They don’t look too bad, but I’m trying to lower them some more with these two changes:

  • When a monthly report bounces or receives a complaint, automatically and mercilessly disable monthly reports for that address. This was already being done for “XYZ is down” email notifications.
  • If none of user’s checks have received any pings in the last 6 months, then that’s an inactive account: don’t send monthly reports for that user.

Reduce the Number of Links in Emails

Each Healthchecks.io alert email contains a summary of all of the checks in user’s account. For each check, it shows its current status, the date of last received ping, and a link to the check’s “Details” page on the website.

I removed the “Details…” links to see if Gmail is not liking emails with too many links.

7 December 2018: Solved?

I am cautiously optimistic that I’ve solved the issue by tweaking the contents of the emails. I haven’t seen the red Gmail warnings for a while now.

Here’s what happened: I noticed that removing the main content area from the email template makes Gmail’s red banner disappear. So I experimented with removing smaller and smaller chunks from the template until I had narrowed it down to a single CSS declaration:

/* MOBILE STYLES */
@media screen and (max-width: 525px) {
    .mobile-hide {
      display: none !important;
    }
}

This class was used to hide some elements to make things fit on mobile screens. Remove usages of this class: no red banner. Add it back: red banner is back! I tested this a number of times to make sure it was not just a coincidence.

Conclusion

If you are seeing “This message seems dangerous” banner above your own emails here’s one thing you can try: use your existing sending infrastructure to send a bare-bones “Hello World” email and see if the red banner shows up or not. If it doesn’t, then, presumably, something inside your regular email body is triggering it. Selectively remove chunks of the content until you find the problematic element. Change it or remove it.

It is also important to do the other things: set up and validate SPF and DMARC records, test your unsubscribe links, monitor the bounce and complaint rates, monitor email blacklists, etc.

Good luck!

Pēteris,
Healthchecks.io

My One-person SaaS Side Project Celebrates its Third Birthday

First, a TL;DR: on how much money I’m making. Healthchecks.io has around 90 paying customers, and the monthly revenue is a little above $700/mo. The bulk of that goes back into running costs.

About Me

I’m Pēteris Caune, a 34-year-old guy from Latvia. I’m married and the father of a baby daughter. I ride and race mountain bikes. In my day job, I work remotely for a small Irish company. I do Python web applications and Android mobile applications mostly, but it varies a lot.

About Healthchecks

Here’s the elevator pitch (assume a tall building).

Let’s say you just finished setting up a cron job that makes database backups and uploads them to S3. You just ran it by hand, and made sure a new .sql.gz file appeared in S3–all is well! Now, if it stops working one day six months from now, would you notice? The backup job can fail in many ways; here are just a few possibilities:

  • A well-meaning DBA changes the database password, but forgets to update the backup script
  • Slowly, over time, the machine doing backups runs out of disk space
  • Somebody “cleans up” AWS IAM policies and the script cannot upload to S3 anymore
  • Everybody has forgotten which machine and which user account is doing the database backups, and the machine gets decommissioned
  • The machine gets rebooted, backup script now fails because reboots were not tested

Here’s what you can do: edit the backup script to send an HTTP GET request to Healthchecks as the very last step. Healthchecks will treat these requests as “I’m still alive!” messages and will keep track of them. As soon as your service is silent for too long, it will send an alert (configurable: email, SMS, Slack, etc.) to you. And since Healthchecks runs on a separate host in a separate datacenter, you will get an alert even if your entire DC goes down.

A quick word of caution: in this specific database backup example, you still want to test the backups by restoring them regularly. There are failure modes where the backup seemingly completes successfully, but the generated database dump is invalid or incomplete.

What other things can you monitor? Here are examples that would benefit from Healthchecks-style monitoring:

  • A job that runs weekly and sends out newsletters or weekly reports
  • A job that synchronizes business data between separate systems. For example, fetches a rss feed and updates entries in a local database
  • A job that checks database replication status every minute
  • A job that updates dns entries when ip address changes
  • A job that renews a letsencrypt certificate. Alternatively, monitors the expiry status of a certificate
  • A machine that sends pings unconditionally every minute. You receive an alert when the machine loses the network connection or is powered off

Why I started Healthchecks

I started work on Healthchecks three years ago, in summer 2015. I was looking for a service like this myself. Dead Man’s Snitch and Cronitor, higher-priced Healthchecks competitors, did already exist. However, they were too expensive for the relatively unimportant things I wanted to monitor. A little arrogantly, I thought I could build something that is cheaper and better. I was also looking for an excuse to work on something fun. Compared to some of my work assignments, here I would be in complete control of the product features, the design, the technical nitty gritty, the pricing strategy and everything else. I mulled over the idea for some time. Still undecided, I started hacking on a blank Django project in June 2015. A month later, I registered the healthchecks.io domain name, and at that point, the game was on!

Timeline of Notable Events

2015–06–11 First commit.

2015–07–18 Registered the healthchecks.io domain name

2015–07–29 The website goes live, running on a single $5 DigitalOcean droplet

2015–09–30 Added Slack and HipChat integrations

2016–10–21 Published “Deploying a Django App with No Downtime”, HN: 184 points, 93 comments

2015–12–10 Braintree payments setup complete.

2016–03–31 First paying customer! $5 MRR

2016–05–10 Implemented Team Access

2016–06–07 100M processed pings

2016–08–20 While road tripping and camping in the wilderness, hchk.io goes down for 24 hours.

Side note: After this incident I bought a used Thinkpad X240 and set up a development environment on it. It now travels with me when I leave home for more than a few hours. I have been poking around the servers while sitting in a parking lot before a cross-country MTB race. The laptop is set up with full disk encryption if it gets lost or stolen. My GPG/SSH key sits on a Yubikey.

2016–09–24 200M processed pings

2016–10–31 $100 MRR

2016–12–27 Implemented Cron expression support

2017–05–04 Migration to Google Cloud Platform

2017–07–31 Finished off and published Cron Syntax Cheatsheet

2017–08–20 1 billion pings processed

2017–10–29 Migration to Hetzner. Bare metal servers.

2018–08–24 Processing around 100 pings per second. $700 MRR–still a hobby project.

Current Status

Healthchecks.io gets a dozen or so new signups per day. Most are just checking out the service. But there are also people who register and set up ten checks and ping them right away.

In September 2017, I implemented rate limiting. In summer 2018, it was off for a bit.

Currently, Healthchecks.io receives 8 million pings per day. There is rate-limiting for checks that get pinged very often. Of the daily 8 million, about 4 million get written to the database.

Most active accounts have 2–20 checks. There are quite a few heavy users too: one account has 900+ checks, another has 400+, another has 300+ checks. There are 17 accounts with over 100 checks.

The most popular notification method is email, followed by webhooks and Slack.

Profit-wise, Healthchecks is still firmly a side project for me. After bills and taxes, there is little profit left. I could cut costs by migrating to a couple of cheap VPSes, and by getting rid of the load balancer. I could severely limit the free plan, and force people to upgrade to paid plans. But by doing that, I would give up my initial goals: free for individuals, fairly priced for companies, and with a good quality service. Healthchecks would turn from “a project I love hacking on and am proud of” to “a project I do solely for money while hating myself”. So–I’m not doing that.

Future Plans

I have no big announcements to write about here. I will keep making small iterative improvements to the service. I will try and keep the code and the design as simple (think KISS) as I can. When it becomes financially viable, I will look at expanding the team-of-one, to improve the bus factor.

With that, thanks for reading! If you haven’t already, check out Healthchecks.io here. The project is also open source: you can grab the code from GitHub, change and improve it, and host your own instance.

Why Some Monthly Reports Were Empty (Or: Isn’t Coding Fun?)

If you have received a monthly report from healthchecks.io in the past few days, it might have looked like this:

Something is missing here…

The report has a header and a footer, but the actual content of the report is simply missing! For reference, this is how a monthly report is supposed to look:

This is a more useful monthly report.

So, what happened? I took me a while, but I think I have figured it out.

Monthly emails have both HTML and text versions. I noticed early on that both the HTML and the text versions are missing their content sections. The text version is shorter and simpler, so we will look at that. Here it is, simplified some more and reformatted for brevity:

<!-- Template emails/report-body-text.html -->

Hello,
This is a monthly report sent by healthchecks.io.

{% include "emails/summary-text.html %}

Cheers,
The healthchecks.io Team


<!-- Template emails/summary-text.html -->

{% load humanize %}
 Status                | Name             | Last Ping
-----------------------+------------------+------------------------------------
{% for check in checks %}
{{ check.get_status }} | {{ check.name }} | {{ check.last_ping|naturaltime }}
{% endfor %}

The child template prints out a simple ASCII-art table. I keep it separate from the parent template because this way I can reuse it in “Check XYZ is DOWN” notifications too. In the incorrect monthly reports, the child template’s content seems to be completely missing. The reports didn’t even have the “Status | Name | Last Ping” header in them. This was very puzzling to me until I learned how Django templating system handles exceptions in child templates when settings.DEBUG is set to False:

class IncludeNode(Node):
    context_key = '__include_context'

    # ... constructor etc.  ...

    def render(self, context):
        try:
            # ... template gets rendered here ...
        except Exception as e:
            if context.template.engine.debug:
                raise
            template_name = getattr(context, 'template_name', None) or 'unknown'
            warnings.warn(
                "Rendering {%% include '%s' %%} raised %s. In Django 2.1, "
                "this exception will be raised rather than silenced and "
                "rendered as an empty string." %
                (template_name, e.__class__.__name__),
                RemovedInDjango21Warning,
            )
            logger.warning(
                "Exception raised while rendering {%% include %%} for "
                "template '%s'. Empty string rendered instead.",
                template_name,
                exc_info=True,
            )
        return ''

Apparently, something in the child template was throwing an exception and causing it to be completely omitted. What could it be? A dead database connection? An SQL error? An exception in Check.get_status() method? An exception in the “naturaltime” filter?

Looking For Patterns

All emails from healthchecks.io are sent through Amazon SES. SES sends me a delivery failure notification, when there is, well, a delivery failure. Conveniently, these SES notifications also contain the original message. I have dug through the recent delivery failure notifications and found a few clues:

  • There were quite a few empty reports, but there were also many reports that looked fine
  • healthchecks.io is currently run from three app servers. Cross-checking with server logs, I found that one specific app server sent all the problematic reports. All of the good reports were sent from the other two.

The problematic app server has a special role in healthchecks.io infrastructure. With not being in the load balancer’s rotation, it is not serving web traffic. Instead, I use it for testing code changes against the production database. Let’s call this server “Canary”. The update procedure for Canary is different from that of the other two app servers. The normal update procedure is as follows:

  • Put the server in maintenance mode. It then starts signalling “I’m down” to the load balancer
  • Stop “sendalerts” and “sendreports” background tasks
  • Wait 60 seconds for the load balancer to redirect traffic from this server
  • Check out a fresh copy of the Django code, install dependencies, prepare CSS and JS bundles, copy static files, restart uwsgi, reload nginx configuration
  • Wait 10 seconds for restarts to complete
  • Take the server out of maintenance mode. Load balancer detects the server as being “up” again
  • Start “sendalerts” and “sendreports” background tasks
  • Sleep for 60 seconds so the load balancer can start sending traffic again. This step is not strictly needed. It makes sure that I wait enough time before updating the next server

Canary is not serving regular web traffic, so the update can be simple and quick:

  • Check out a fresh copy of the Django code, install dependencies, prepare CSS and JS bundles, copy static files, restart uwsgi, reload nginx configuration

Note that this shorter version does not stop or start the “sendalerts” or “sendreports” tasks.

So, here is what happened. On 18 March, I committed and deployed a backwards-incompatible database schema change, which removes the “last_ping_body” field from the Check model. I updated all three app servers to reflect this change. But the “sendreports” task on Canary did not get restarted and so kept running the previous version of the code. Whenever Canary was sending a monthly report, it would render its template. From the template, it would run an SQL query that references the now-missing “last_ping_body” field. The Django ORM would throw an exception, which the Django templating system would swallow, rendering an empty string (“”) instead. And, just like that, the end user would receive an empty report.

There were three app servers all running the “sendreports” background command. They would each wake up once a minute and check if any reports are due to be sent. One of the three was sending empty reports so, from 18 March until 22 March, about 33% of the sent monthly reports were empty.

Conclusion

I am taking several steps to make sure that a similar problem does not happen again in the future. Most importantly, Canary was not supposed to be continually running the “sendalerts” and “sendreports” tasks. I have updated the deployment scripts so that these tasks always get stopped before updating the Django app. And, for Canary, they have to be started manually.

When serious errors like bad SQL queries happen, they should be visible and loud. I will look into configuring the Django app so that it does not silently ignore exceptions from included templates. I have already made a few small changes to move database operations out of the template rendering stage. I will also be making sure that the “sendalerts” and “sendreports” tasks deliver Sentry notifications when they crash.

Apologies for the empty reports, they should not happen again!

Happy monitoring,
— Pēteris Caune, healthchecks.io

From DigitalOcean to Linode to Google Cloud Platform: the Evolution of healthchecks.io

In this article I will look at the current hosting setup of healthchecks.io, how it has evolved during the past two years, and what challenges I faced running this small but lively service.

DigitalOcean

When I first made the service public 2 years ago, it was running off a single $5/mo DigitalOcean droplet. It had a single CPU core and 512MB of RAM. The droplet was running both the Django web application and the Postgres database. Initially the service was receiving next to no traffic so everything was working well. A few months later I switched to a $20/mo droplet (two cores, 2GB RAM). I was deploying the code with a Fabric script, and had put some thought into avoiding downtime during code deploys.

Fast forward to June 2016. Feature-wise healthchecks.io was already useful, but its fault tolerance story was pretty bad. It was hosted on a single server with nightly database backups. As a first step, I decided to split up the components (database, healthchecks.io server, hchk.io server, background tasks) and run them on separate VMs. This helped some failure scenarios. For example, if the main website went down or experienced heavy traffic, hchk.io would still accept pings, and notifications would still be sent out. However, the database server going down would still be disastrous.

compose.io

The database being a single point of failure did not sit well with me, and I kept exploring options for a HA Postgres setup. It looked as if managing a fault-tolerant database cluster would be a full-time systems administration job in itself. I looked at the “pay someone else to do it” options. They were fairly limited by the tight budget I had. I could not afford Heroku Postgres with HA, for example, as it starts at $200/mo. After much consideration, I committed to go with compose.io. On paper their specs and pricing looked good, and I had tested their service with a snapshot of production data (but not with production-level traffic!) I even got to see their fail-over process in action. The database was unavailable for a few minutes during the fail-over, but it did come back up and continued to work without my intervention.

The move to compose.io did not go well. After I pointed production traffic to the new database, I soon started seeing a variety of Sentry reports about database connection problems. compose.io support advised me that my database was starved of memory, and recommended to scale its RAM allocation. Using their convenient scaling slider, I increased the RAM allocation and my monthly bill by an order of magnitude. After doing that, I was seeing fewer dropped database connections but they were still fairly regular. At this point I was:

  • getting complaints from users
  • waking up every 3 or so hours during the nights to check up on services
  • paying significantly more than originally planned
  • not getting actionable advice from compose.io support on how to troubleshoot my connection issues

Linode

Given enough time with tcpdump and WireShark I would probably have solved my dropped connection issues. But I needed to fix things fast. I gave up the HA requirement and moved the database back to a plain VPS. This time I went with Linode for two reasons:

  • it had slightly better pricing than DigitalOcean.
  • I was going to use TLS-terminating load balancers. DigitalOcean had just launched theirs. Linode’s NodeBalancers had been around for a while and seemed a safer choice. Also, they supported IPv6, DO did not.

I updated my deployment scripts, made a migration plan, did a few dry runs, slept on it, and then migrated over to Linode. Error reports ceased, and my monthly bill was in check again. But once again the database was a single point of failure. On the bright side, I was now load balancing the incoming HTTP requests, and my service could tolerate the loss of a web server node.

The traffic that hchk.io receives comes in bursts. On average, it receives around 30 requests per second, but there is a short traffic spike (hundreds of requests at once) every five minutes, a bigger spike every round hour, and a period of elevated request rate every midnight (UTC).

Traffic spikes, as seen from Postgres

hchk.io traffic also is unusual in that every request is a “one-off” and needs to do a brand new TLS handshake. I learned that a single NodeBalancer can only do about 200 TLS handshakes per second. During the traffic spikes the load balancer was becoming a bottleneck. Requests would sometimes take 3+ seconds to complete. Clients that use aggressive timeout settings would see them as failed requests. A band-aid fix was to add a second load balancer, and split the traffic between the two using round-robin DNS.

I also learned that DigitalOcean’s load balancers have similar TLS handshake performance to Linode’s so they were no good either. Looking further, I found that Google’s Cloud Load Balancer can handle as many handshakes as I could throw at it. And it had IPv6 support, albeit in alpha preview state, too! So I started plotting a move to Google Cloud Platform.

Google Cloud Platform

I went through the migration process once again, and, starting from May 4 2017 healthchecks.io has been running on Google Cloud Platform. The current setup is:

  • a managed Cloud SQL database. Postgres on Cloud SQL is currently in beta and they don’t have the HA option yet, but hopefully that is coming in the future
  • three app servers, provisioned by my plain Fabric scripts
  • Google’s Cloud Load Balancer splits traffic between the three app servers

I experimented with GKE, Google’s managed version of Kubernetes, but ultimately opted to keep things simple and straightforward: plain virtual machines and plain Fabric commands for various administrative tasks.

Once I started exploring Google Cloud Platform’s logging tools I came across concerning “502 Bad Gateway” log messages coming from the load balancer. They were infrequent and so took a long time to troubleshoot (make a configuration change — monitor logs for a few days to see if errors are gone — repeat), but I am cautiously optimistic these are now fixed for good. In short, I had to tune a number of sysctl parameters and nginx options so my app servers could properly handle bursts of new connections. The following resources helped a lot:

For updating code on app servers I am using the “rolling update” pattern: take an app server out of load balancer rotation, update it, put it back in rotation, then move on to the next app server. Here is an outline of this process for a single server:

def update():
    # Going down...
    maintenance_on()
    stop_sendalerts()
    stop_sendreports()
    print("sleeping for 120s")
    # Wait for load balancer to fail us
    time.sleep(120)

    # Actual update
    www()
    hchk()
    nginx()    

    # Coming up...
    print("sleeping for 30s")
    time.sleep(30)
    maintenance_off()
    start_sendalerts()
    start_sendreports()
    print("sleeping for 120s")
    # Wait for load balancer to declare us healthy
    time.sleep(120)    

maintenance_on() puts the server in “maintenance mode”. When in maintenance mode, the server can still process incoming requests, but it starts reporting itself as unhealthy to the load balancer, and load balancer gradually diverts traffic away from it. It takes a while for the load balancer to update, so the script waits 120 seconds before it goes ahead with updating and restarting. A complete update takes some time, but it is completely transparent for the end users … as long as I am not deploying backwards-incompatible database schema changes!

This is where healthchecks.io is now, hosting-wise. Page load times are good. No 5xx errors in the load balancer logs (fingers crossed!). The database is currently not fault tolerant but hopefully that will change in the future. Monthly bill from Google is in the $150-$200 range.

Lessons learned

When evaluating a product or service, it is imperative to test it with production-level workload. I learned this with my compose.io fiasco, and also when I hit NodeBalancer’s capacity limits.

Simple problems I can often solve myself. When I needed help with harder problems, and tried contacting Support, I was getting what I paid for, so to speak:

  • Responses from compose.io support were about as useful as Richmond’s comments on flashing lights.
  • Google charges for support separately. It starts at $150/mo for the Silver plan.
  • Linode gave me a straight and honest answer about NodeBalancer limitations, which I appreciated.

Finally, as I was looking for solutions, I explored a number of tools and technologies which I did not ultimately end up using. They all go into the “bag of tricks” and may be useful in future projects.

Meet the healthchecks.io Ops Team!

With that, thanks for reading! And, if you are not yet monitoring your cron jobs and background tasks for silent failures, I welcome you to check out healthchecks.io!

— Pēteris Caune, healthchecks.io

Cron Expressions: monitoring for jobs with fixed schedules

Cron expression support has been the most requested feature, since the launch of healthchecks.io. Long story short, it’s been implemented and is ready to use! You can now set up a time-based schedule for your checks, using the exact same syntax you use in crontab files.

For each check, you can switch between “simple” and “cron” mode:


In the simple mode, you select two parameters: period and grace time. Period is how often you expect the check to be pinged. When a ping does not arrive on time, grace time specifies how long to wait before sending an alert.

In the cron mode, you specify a cron expression, a time zone, and grace time:


The cron expression defines a fixed, time-based schedule. It allows for greater flexibility than the simple “period” parameter. For example, you can set up a check that expects a ping at the beginning of every other hour, only on weekdays. Here’s the expression you would use for that: “0 0/2 * * 1–5”.

If your server’s time zone is not UTC, you must also specify its time zone. The time zone field supports auto-complete and lets you select time zones by their IANA names. On Ubuntu systems, you can look up the system’s time zone in /etc/timezone file.

Finally, the grace time parameter works the same as in the “simple” mode. Set it to a value that comfortably covers the expected run time of your job.

Example

Let’s say you have a server that runs a backup script each morning at 6:08 AM, New York time. The backup script usually takes 1 to 2 minutes to complete and should never exceed 5 minutes. The crontab entry might look something like this:

8 6 * * * /home/user/backup.sh && curl -fsS — retry 3 https://hchk.io/fe33025a-330d-4bf0-93c4-7e433bb474da > /dev/null

For monitoring this cron job, you would set up a check as follows:

Cron expression: 8 6 * * *
Timezone: America/New_York
Grace time: 5 minutes

Notes for self-hosted installations

If you are self-hosting healthchecks.io code, there are a few things you will want to know.

Database triggers are not used any more. There used to be a management command, ensuretriggers, for creating a database trigger. The trigger would automatically update the api_check.alert_after field whenever a check is saved. This trigger is not needed any more and would interfere with cron-style checks. Remove it with the droptriggers management command:

./manage.py droptriggers

It is also a good idea to make a fresh backup of the database before major upgrades such as this one.

Conclusion

This is the initial release of cron expression support. It works well enough to be useful, but will still require careful testing, especially around daylight saving time handling. It may also see various small user interface refinements. If you use cron-style checks and notice any problems, please file an issue!

Adding cron expression support has been one of the more complex tasks since the start of the project, but it has been worth it. Since soft-launching the feature two weeks ago, 140+ new checks have already been set up to use cron expressions. This has been gratifying to see.

With that, happy monitoring and happy 2017!

Pēteris, 
healthchecks.io