Monitoring PostgreSQL With pgmetrics and pgDash

I am currently trialing pgmetrics and pgDash for monitoring PostgreSQL databases. Here are my notes on it.

pgmetrics is a command-line tool you point at a PostgreSQL cluster and it spits out statistics and diagnostics in a text or JSON format. It is a standalone binary written in Go, and it is open source. Here is a sample pgmetrics report.

Rapidloop, the company that develops pgmetrics, also runs pgDash – a web service that collects reports generated by pgmetrics and displays them in a web UI. pgDash is a hosted web service and has a monthly cost.

pgDash also supports alerting. For an idea of the types of alerting rules pgDash supports, here are the rules I have currently set up. This is my first go at it, the rules will likely need tweaking:

First Steps

Here’s how you run pgmetrics:

pgmetrics --no-password <dbname>

This produces a neatly formatted plain text report. To produce output in JSON format, add “-f json”:

pgmetrics --no-password -f json <dbname>

Note: I’m running pgmetrics on the database host as the “postgres” system user. pgmetrics can also connect to the database over the network. If you specify the <dbname> parameter, pgmetrics will return detailed statistics about every table for the specified database. This parameter is optional, and you can also use it multiple times:

pgmetrics --no-password -f json first_database second_database

The next task is to submit the pgmetrics output to the pgDash API. pgDash provides a CLI tool “pgdash” for submitting the reports to their API. pgmetrics output can be piped straight into it:

pgmetrics --no-password -f json <dbname> \
| pgdash -a <api-key> report <server-name>

Tangent: Use curl Instead of pgdash

Assuming the “pgdash” tool just POSTs the report to an HTTP API, would it be possible to replace it with curl? I contacted pgDash support with this question. Their answer – it is not officially supported, but yes, it can be done. The pgDash API endpoint is https://app.pgdash.io/api/v1/report, and it expects the payload in the following form:

{"api_key": "<api-key>", 
 "server": "<server-name>", 
 "data": <the JSON document generated by pgmetrics>}

I used the jq utility to prepare payload in the required format, and then used curl to submit it:

pgmetrics --no-password -f json <dbname> \
| jq '{"apikey":"<api-key>", "server":"<server-name>", "data": .}' \
| curl -d @- https://app.pgdash.io/api/v1/report

I also added request body compression:

pgmetrics --no-password -f json <dbname> \
| jq '{"apikey":"<api-key>", "server":"<server-name>", "data": .}' \
| gzip \
| curl --data-binary @- -H "Content-Encoding: gzip" https://app.pgdash.io/api/v1/report

The next step would have been to add curl parameters for retries and timeout, but my hack was starting to look too much like a hack, so I switched back to the pgdash CLI tool (which is open-source, by the way).

Cron

So far, I had assembled a command that collects database metrics, and submits them to pgDash. Here it is again:

pgmetrics --no-password -f json <dbname> \
| pgdash -a <api-key> report <server-name>

Next, I wanted to run this automatically, on regular schedule. The obvious way to do that is a cron job. Logged in as the “postgres” system user, I ran “crontab -e” and added this line in the editor window (replacing the <dbname>, <api-keys> and <server-name> placeholders with the actual values, of course):

*/5 * * * * /usr/local/bin/pgmetrics --no-password -f json <dbname> | /usr/local/bin/pgdash -a <api-key> report <server-name>

Note: cron doesn’t support line continuations with “\”, so the entire command has to be a single line.

On my system, the “pgmetrics” and “pgdash” binaries are in /usr/local/bin. I added /usr/local/bin to cron’s PATH, which let me clean up the command a little:

PATH=/bin:/usr/bin:/usr/local/bin    
*/5 * * * * pgmetrics --no-password -f json <dbname> | pgdash -a <api-key> report <server-name>

With this in place, the pgDash web service was getting a fresh report every 5 minutes. But what would happen if the cron job somehow broke? Would pgDash detect the absence of new reports, would it alert me about it? I asked pgDash support about this too, and the answer is no. The suggested solution is to use an external tool to monitor the cron job execution. No worries – as it happens, I have just the tool for this! Let’s add Healthchecks.io to the mix.

How to Watch the Watchmen

The easiest way to set up monitoring for a cron job is to create a new Check in Healthchecks.io, copy its ping URL, and add a curl call to the copied URL at the end of the normal cron command.

In the following example I am using line continuations for readability, but, again, in the actual crontab the command would need to all be on a single line:

PATH=/bin:/usr/bin:/usr/local/bin    
*/5 * * * * pgmetrics --no-password -f json <db-name> \
| pgdash -a <api-key> report <server-name> \
&& curl https://hc-ping.com/<uuid>

Here’s how this works. The final curl call runs only if pgdash exits with exit status 0. curl makes an HTTP GET request to hc-ping.com, and Healthchecks.io registers it as a “success” signal. As long as the success signals arrive on schedule, Healthchecks.io stays quiet. When a success signal doesn’t arrive on time, Healthchecks.io sends out alerts. This is already functional, but I had a few improvements in mind:

  • Measure job execution time by requesting https://hc-ping.com/<uuid>/start before pgmetrics runs
  • If pgmetrics or pgdash exits with a non-zero exit status, signal a failure by sending a request to https://hc-ping.com/<uuid>/fail
  • Capture the command’s output, and send it along with the success or failure signal
  • If a request to hc-ping.com fails or times out, retry it a few times

All of this can be done in a shell script, and even in a shell one-liner, but a simpler option is to use runitor:

runitor -uuid <uuid> -- <command goes here>

runitor takes care of all of the above – it sends the start signal, it captures stdout and stderr, and it signals success or failure depending on the command’s exit status.

Here’s my cron job, updated to use runitor:

PATH=/bin:/usr/bin:/usr/local/bin    
*/5 * * * * runitor -uuid <uuid> -- \
bash -c "pgmetrics --no-password -f json <dbname> | pgdash -a <api-key> report <server-name>"

There’s another thing I wanted to try out: slug URLs. Healthchecks.io supports two ping URL formats:

  • uuid format: https://hc-ping.com/<uuid>
  • slug format: https://hc-ping.com/<ping-key>/<slug>

The slug format is a new feature and I wanted to dog-food it here. runitor supports the slug format starting from version v0.9.0-beta.1.

Note: my actual database name is “hc”, the server name is “db7”, and the check’s slug is “db7-pgmetrics”. In the remaining code samples, I’ll use these values instead of placeholders. The API keys in the samples are made up though.

Here’s the cron job definition, updated to use slug URLs:

PATH=/bin:/usr/bin:/usr/local/bin    
*/5 * * * * runitor -ping-key sC2Tc1MwVVWlpEItz -slug db7-pgmetrics -- \
bash -c "pgmetrics --no-password -f json hc | pgdash -a tFAJJ5L7a4ft-qbqR5JIDA report db7"

One final tweak was to move the keys and the server name to environment variables:

SERVER=db7
PING_KEY=sC2Tc1MwVVWlpEItzY
PGDASH_KEY=tFAJJ5L7a4ft-qbqR5JIDA
PATH=/bin:/usr/bin:/usr/local/bin

*/5 * * * * runitor -ping-key $PING_KEY -slug $SERVER-pgmetrics -- \
bash -c "pgmetrics --no-password -f json hc | pgdash -a $PGDASH_KEY report $SERVER"

This way, the PING_KEY and PGDASH_KEY values don’t get logged to syslog every time the cron job runs. It also looks cleaner.

Time to test this setup. In the happy case, where pgmetrics, pgdash, and runitor all run with no issues, Healthchecks.io shows the start and success signal arriving neatly every 5 minutes:

I tested the case where pgmetrics exits with a non-zero exit code. I simulated this by changing the database name to “surprise”, which does not exist. After the next cron job run, I got an email notification from Healthchecks.io:

I also tested the case where pgdash fails. I simulated this by changing the pgDash API key to an invalid one:

The last thing left to handle was automated provisioning of the cron job.

Automated Provisioning

The next time I set up a new database server, I don’t want to copy binaries and edit crontab by hand. I want this automated. First, here’s my template for the cron job:

SERVER=%(HOSTNAME)s
PING_KEY=%(PING_KEY)s
PGDASH_KEY=%(PGDASH_KEY)s
PATH=/bin:/usr/bin:/usr/local/bin

*/5 * * * * runitor -ping-key $PING_KEY -slug $SERVER-pgmetrics -- bash -c "pgmetrics --no-password -f json hc | pgdash -a $PGDASH_KEY report $SERVER"

And here’s the Fabric task which uploads binaries and installs the cron job on the remote server:

def pgdash():
    # this loads secrets from an encrypted file 
    # into a global variable env.vault        
    require_vault()

    # this uploads the binary executables.
    # cannot use f-strings here because Fabric 1.x uses Python 2.7 (yep)
    # the require.file command comes from a helper library, fabtools
    for name in ["pgdash", "pgmetrics", "runitor"]:
        require.file("/usr/local/bin/" + name, source="files/" + name, mode="755")

    ctx = {"HOSTNAME": get_hostname()}
    ctx.update(env.vault)

    # read and fill out the template, upload it to a temporary 
    # file on the remote server
    require.files.template_file(
        "/tmp/postgres-crontab",
        template_source="files/postgres-crontab.tmpl",
        context=ctx,
    )

    # install the crontab and delete the temporary file
    sudo("crontab /tmp/postgres-crontab", user="postgres")
    run("rm /tmp/postgres-crontab")

One interesting thing here is how the cron job definition gets installed. When I set up a new cron job interactively, I run “crontab -e” and a text editor opens. I type or paste the new job, save, and exit the editor. This method would be hard to automate, but there is an automation-friendly way:

crontab <filename>

This replaces the current user’s existing cron jobs with whatever is in the file referenced by <filename>. Nice and simple!


I now have continuous pgDash monitoring set up for Healthchecks.io primary and standby database servers. I can look at the reported data and see which indexes are bloated, how far away is the transaction ID exhaustion event, which Postgres configuration settings need tuning, and all that good stuff. Thanks for reading and happy monitoring!

Healthchecks Turns 6, Status Update

Time flies and Healthchecks.io is already 6 years old. Here’s a quick review of notable recent events and the project’s current state.

Database Migration

Healthchecks.io database used to run PostgreSQL 10. In March 2021 I migrated it to PostgreSQL 13. For the upgrade method, I used logical replication, as suggested on Reddit.

The idea is to set up a Postgres 13 replica, replicate the data to it, and then failover to it. But there are of course several gotchas and everything has to be thoroughly tested before. I found this guide and worked through it. I made a step-by-step migration plan and tested it on Vagrant VMs. I then iteratively improved the plan and did more test migrations until everything was working smoothly, and I knew the order of commands to run almost by heart.

Then it was time to announce maintenance, provision new hardware (two Ryzen 5950X machines: 16 cores, 64GB RAM, and 2x4TB NVMe drives for each, aw yiss), set them up, and do the migration for real. And it all worked as planned!

Wireguard

Hetzner has a feature called vSwitch for setting up private networks between hosts. I had it set up, and the infrastructure servers (load balancers, app servers, databases) were communicating between themselves over internal IPs.

In my experience, vSwitch turned out to be less reliable than the regular network. There was an incident where the vSwitch network interface on one machine was not working while the public interface was still fine. The issue got resolved after contacting Hetzner support, but I decided to go back to using public interfaces. I used firewall rules to control which IPs can connect to which ports.

Although Hetzner support says their internal network is secure, and customers cannot snoop on other customer traffic, I wanted to reduce the trust placed on Hetzner, and set up Wireguard tunnels between the servers. I did not use Tailscale or anything fancy like that, just a few Fabric recipes for initial setup, and for updating peers (when a server is added or removed from the network).

A small gotcha here was services not always automatically starting after system reboot. I had to tweak systemd service definitions to make sure network-dependent services (nginx, postgres) start only after Wireguard has initialized.

Self-hosted Postgres, bespoke Wireguard tunnels, can you hear the innovation tokens burning up yet? 🙂

Signal

Healthchecks.io has had a Signal integration for a couple months now. I think Signal has been the most tricky to implement and set up so far. Unlike most other services, Signal does not have a public HTTP API you can call to send messages. Instead, you have to run a local Signal client locally and communicate with it to send the messages. Luckily there is signal-cli, a wrapper around the official Signal Java client library. I run signal-cli under a separate OS user account, and Healthchecks communicates with it over DBus (details). Multiple app servers are sending out notifications, each one runs signal-cli, and all signal-cli instances are linked to a single Signal account (phone number).

After deploying and announcing the Signal integration, I was glad to see a quick uptake:

  • SMS was introduced in July 2017, and has approx. 500 configured integrations
  • WhatsApp was introduced in July 2019, and has approx. 450 configured integrations
  • Signal was introduced in January 2021, and has approx. 350 configured integrations

When looking at these numbers, one factor to keep in mind is that SMS and WhatsApp have a minimal sending quota in free accounts (because sending these notifications costs money), while Signal is unrestricted.

Dark Mode

Healthchecks now has an optional dark theme. You can activate it in Account Settings – Appearance.

Implementing dark mode was, as expected, lots of work, and there is more work left. Aside from the obvious – page background, body text, panels, buttons – various other bits needed theming, each in their specific way:

  • Bootstrap components like menus
  • Selectize dropdowns
  • Period and Grace sliders
  • The icon font with integration logos
  • Syntax highlighting for code samples

It was interesting work. I use Sublime Text, and found the Color Highlighter plugin very handy when working with colors:

After publishing the initial dark mode implementation, I was happy to see people starting to use it. It was not work-for-nothing, a significant number of users prefer the dark mode over the default!

Fuzz Testing croniter, Introducing cronsim

Healthchecks.io had an incident where a single bad cron expression caused system-wide issues. The bad cron expression was making the croniter library throw an unexpected exception. This lead to a crash-restart loop in the notification sending process. The initial fix was to add “try .. except” around croniter calls, but I later also spent time fuzz testing croniter. I found and filed several crashing issues. The worst one was to do with expressions like: “0-1000000000 * * * *“. By varying the number of zeroes I could get the python process to use up all system memory and eventually crash. I reported this issue privately in January 2021, and the maintainer fixed it the same day.

After diving around the croniter code, I wanted to try my hand at writing a slimmed-down version. And so I did, welcome cronsim. It is 250 lines of code, and it does just one thing: it takes a cron expression and returns a datetime iterator.

I’ve tested cronsim with a large corpus of cron expressions, and, for every expression I tested, it produced the same results as the croniter library. Except for one case, where both libraries produce incorrect results: the handling of daylight saving time (DST) transitions. Getting this right has been surprisingly hard, and I have not cracked this problem yet. But I did come up with a cool toy: I installed a Debian system inside qemu (instructions) and used qemu emulator flags to speed up the system clock inside the VM. With this contraption, I can test cron expressions with the actual running Debian cron daemon, and see results in minutes instead of hours or days. Anyway, more work is needed here.

Development Roadmap

The default plan is to continue making small iterative improvements.

In the background, I am also bouncing around ideas around product architecture and reliability. One area is the reliability of the Ping API. Whenever a client makes an HTTP request to a ping endpoint, there is a small, but non-zero probability the request will fail due to TCP packet loss. The probability increases as the distance from the client to the server increases. It would be ideal to put the server close to the client. There are different ways to go about this and lots to explore. One potential building block is CockroachDB. Very impressively, in my testing Healthchecks test suite passed with CockroachDB backend out of the box. It Just Worked, but to make it perform well, I would need to make several changes. For example, the big and write-heavy “api_ping” table has an auto-incrementing integer primary key. It would not work well in a distributed database.

Healthchecks.io the Business

I’ve reduced my other work commitments, and Healthchecks.io is now my main occupation and my main source of income. Not quite “full-time” yet, but getting there!

I regularly update the About page with running stats (ping volume, the number of users, revenue, …), you can check out the numbers there!

As the project’s revenue slowly creeps up, I start to get more regular “Acquisition?” emails. I don’t have plans to sell the project in the foreseeable future. Too much work and soul put into it, and I also simply enjoy working on it and running it (aside from dealing with infra outages I have no control over, these are not fun at all!).

That’s it for now, thank you for reading! Here’s to another 6 years, and in the closing here’s a complimentary picture of me attacking a hornet nest with a pressure washer:

Happy monitoring,
Pēteris,
Healthchecks.io

Everything Privacy

Here’s is a look back at the privacy-related changes and milestones of the Healthchecks.io website. If you also run a small SaaS, feel free to compare the notes. If you have suggestions or questions, please let me know!

Dec 2015, Published Initial Privacy Policy

I was setting up payments via Braintree, and it required my site to have a privacy policy.

At the time, anything privacy-related was entirely off my radar. The thought “I need to formulate and publish a privacy policy” had not even crossed my mind.

I used an online service, privacypolicies.com, to generate a generic Privacy Policy document. I added it to the site, and that satisfied the Braintree requirement.

Jun 2016, Published Terms & Conditions

I was adding a PayPal payment option, and PayPal additionally required the site to have Terms & Conditions.

I used TermsFeed to generate a generic Terms & Conditions document for $50.

May 2018, Updated Privacy Policy for GDPR

Leading up to the GDPR coming into force, I was looking at what I needed to do to prepare. On the technical side, the site seemed to already be in good shape. It was not using any advertising or tracking cookies. It was not collecting any unneeded information. It was using the collected information only in the intended way (email addresses to send notifications, phone numbers to send SMS notifications, etc.). The “Close Account” function was there, letting users remove their data from Healthchecks.io systems at any time, without assistance from Support (me).

The Privacy Policy, however, seemed to need updating. Figuring out what needs to go in the privacy policy was frustrating. I spent a fair bit of time comparing other companies’ privacy policies, looking for templates, and reading conflicting advice. In the end, I went to Fiverr and looked up somebody who claimed to be a lawyer specializing in GDPR. I ordered a custom, GDPR-compliant Privacy Policy for $250 from them. I was then directed to a form with a number of questions about my company, and, sometime later, they had produced the document.

The new Privacy Policy was, unsurprisingly, a template job. If I took a sentence from it and plugged it in Google, I could find other very similar privacy policies. But that’s to be expected for the price, and it was better than what I had before. So I went ahead and published it.

Mar 2019, Implemented Inactive Account Deletion

Data is not an asset, it is a liability.

I implemented a system that automatically removes abandoned accounts. If an account is inactive for a full year, the system sends an email notification. The notification basically says, “Sign in in the next 30 days, or we will delete your account”. If the account is still inactive 30 days later, the system deletes the account.

There is a neat side-benefit to sending the deletion notices: they can sometimes “reactivate” old users. I haven’t investigated how often that happens, though.

Jun 2019 Stopped Using Cloudflare Load Balancing

I started running my own Haproxy instances on bare metal servers. I did this mainly because I wanted a better and lower-level control of the load balancers. But it also improved the privacy aspects: Cloudflare was no longer proxying my traffic. I’m still using Cloudflare as a DNS provider to this day.

Note: I was and still am a fan of Cloudflare. Nevertheless, there is one less thing to worry about GDPR-wise if the traffic does not go through them.

Jul 2019, Rewrote Privacy Policy

I wanted to add a list of data sub-processors to the Privacy Policy but ended up redoing it entirely. I used UptimeRobot’s privacy policy as a base (with their permission) and went at editing it. This time, I used Fiverr services only for proofreading my edited version.

Sep 2019, Improved Database Backups

Every day, the database server creates a full database dump, encrypts it, and uploads it to an S3 bucket. It does this in a cron job. (And, of course, I have monitoring set up for the cron job!)

I made a few DevOps-y improvements there:

  • Moved the storage location from us-east-1 (N. Virginia) to eu-central-1 (Frankfurt)
  • Added a lifecycle rule to delete backups older than 45 days. That’s one less thing I need to do manually every month!

May 2020, Statuspage.io Cookie Saga

I’ve written a separate blog post about this, but the short version is: I discovered that status.healthchecks.io sets tracking cookies. That was not OK. Several months and several hundred emails later, Atlassian removed the tracking cookies.

Jul 2020, Migrated Email Sending to AWS EU Region

Healthchecks.io uses AWS SES to send email notifications. Like backups and S3, I decided to switch from us-east-1, their default region, to eu-central-1. I was not aware of Schrems II at the time; I only wanted to move the SMTP servers closer to my servers for reliability.

There is a privacy benefit on the paper, although I’m sure AWS engineers in the U.S. can access AWS infrastructure in the EU, so the Schrems II concerns still apply.

Sep 2020, Removed Customer Data From Accounting Reports

I outsource Healthchecks.io accounting to a local accounting company. At the start of every month, I collect all invoices and bank statements and send them off. They process the documents and prepare the tax reports.

I realized that some of the statements contain personal information. For example, PayPal’s monthly statement contains customer names and email addresses. I checked with the accountants, and they confirmed they don’t need the names or emails for anything. So, I started scrubbing the personal information from the statements before sending them each month.

Nov 2020, Closed ChartMogul account

In light of Schrems II, I was reviewing the list of Healthchecks.io data sub-processors based in the U.S.; there were four:

  • Amazon (emails)
  • Twilio (SMS, WhatsApp, voice calls)
  • Braintree / PayPal (subscription management, CC and PayPal payments)
  • ChartMogul (revenue analytics)

The first three were essential and not easy to replace. ChartMogul, however, was merely nice-to-have. It was also the only one with no mention of Standard Contractual Clauses anywhere in its Data Processing Agreement. So I decided to stop using it and closed my account.

Dec 2020, Migrated from Zoho Mail & GMail to Fastmail

For receiving and sending email at contact@healthchecks.io, I had cobbled together a Zoho and Gmail setup: Zoho was receiving email on my custom domain and forwaring it to my personal Gmail address. This was back in 2015 when the service was not yet generating any revenue.

This winter holiday break, I moved email hosting to Fastmail ($50 / year). It’s a simpler setup, and I am more comfortable as a paying customer of Fastmail than a free user of Zoho and Google.


And this is where we are now. Now, why do I care about privacy anyway? I’ve thought about it.

In my experience, a company’s privacy practices are an indicator of its general “wholesomeness.” An obnoxious cookie banner is a sign of more dark patterns to come. On the other hand, privacy-first companies tend to treat their customers with respect in other aspects as well.

And the other thing. While most users probably won’t ever read the Privacy Policy or care what email hosting Healthchecks.io uses, if it’s important to me, then I work on it. As Ocramius said in their Why I do open source? article:

This corner of the codeverse is mine to decide where engineering steers towards, and this capability is extremely precious to me.

And, with that, thanks for reading!
Pēteris

Two-factor Authentication

Healthchecks.io now supports two-factor authentication using the WebAuthn standard. Here is how it works: in the Account Settings page, users can see their registered FIDO2 security keys and register new ones:

When logging in, if the account has any registered keys, Healthchecks requires the user to authenticate with one of their keys:

Users can register multiple keys, users can give their keys nicknames, and users can remove registered keys. Removing the last key deactivates two-factor authentication. And, from the user’s perspective, for now at least, that’s mostly it!

There are some nuances on the UI side, and there are quite a few subtle things on the technical side to deal with. Here are a few examples.

If the user has just one registered security key, losing the key means losing access to their account. It is good to have a second, backup key and store it separately. I added a note in the UI about that:

When the user removes their last security key, they are effectively also disabling the two-factor authentication for their account, and should be aware of it:

In a high-risk situations (add security key, remove security key, change email address, change password, close account) the service should require the user to re-authenticate. My solution here is to send a six-digit confirmation code to the user’s email and require the user to enter it back.

When the user enters the correct code, they can continue to the sensitive action, and will not be asked to enter another code for the next 30 minutes. The code entry form uses rate limiting to prevent brute-force attacks.

For implementation, I used the fido2 Python library by Yubico. They provide a sample Relying Party implementation, which I used as a reference. Yubico also provides a WebAuthn Developer Guide. It is a good resource with the right level of detail, and I ended up reading and re-reading it multiple times.

To clear up specific questions (“What are the requirements for user handle?“, “What the relying party identifier should look like?“, …) I had to look at the W3C WebAuthn specification a few times as well.

This is a preliminary implementation. I’ve personally tested it with several types of security keys on Firefox and Chrome. If you experience any issues with registering or authenticating with your security key(s), please report it!

Will there be support for other 2FA methods: SMS, TOTP?

SMS – no. TOTP – potentially, if there is significant demand for it.

That’s all for now, thanks for reading!
Pēteris,
Healthchecks.io

Using Github Actions to Run Django Tests

I recently found out Travis CI is ending its free-for-opensource offering, and looked at the alternatives. I recently got badly burned by giving an external CI service access to my repositories, so I am now wary of giving any service any access to important accounts. Github Actions, being a part of Github, therefore looked attractive to me.

I had no experience with Github Actions going in. I have now spent maybe 4 hours total tinkering with it. So take this as “first impressions,” not “this is how you should do it.” I’m a complete newbie to Github Actions, and it is just fun to write about things you have just discovered and are starting to learn.

My objective is to run the Django test suite on every commit. Ideally, run it multiple times with different combinations of Python versions (3.6, 3.7, 3.8) and database backends (SQLite, PostgreSQL, MySQL). I found a starter template, added it in .github/workflows/django.yml, pushed the changes, and it almost worked!

The initial workflow definition:

name: Django CI

on:
  push:
    branches: [ master ]
  pull_request:
    branches: [ master ]

jobs:
  build:

    runs-on: ubuntu-latest
    strategy:
      max-parallel: 4
      matrix:
        python-version: [3.6, 3.7, 3.8]

    steps:
    - uses: actions/checkout@v2
    - name: Set up Python ${{ matrix.python-version }}
      uses: actions/setup-python@v2
      with:
        python-version: ${{ matrix.python-version }}
    - name: Install Dependencies
      run: |
        python -m pip install --upgrade pip
        pip install -r requirements.txt
    - name: Run Tests
      run: |
        python manage.py test

And the results:

Tests fail, missing dependency

Everything looked almost good except for a couple missing dependencies. A couple of dependencies for optional features (apprise, braintree, mysqlclient) are not listed in requirements.txt, but are needed for running the full test suite. After adding an extra “pip install” line in the workflow, the tests ran with no issues.

Adding Databases

If you run the Healthchecks test suite with its default configuration, it uses SQLite as the database backend which usually Just Works. You can tell Healthchecks to use PostgreSQL or MySQL backends instead by setting environment variables.

Looking at Github Actions documentation suggested I should use service containers. By using special syntax, you tell Github Actions to start a database in a Docker container before the the rest of the workflow execution starts. You then pass environment variables with the database credentials (host, port, username, password) to Healthchecks. It took me a few failed attempts to get running, but I got it figured out relatively quickly:


name: Django CI

on:
  push:
    branches: [ master ]
  pull_request:
    branches: [ master ]

jobs:
  build:

    runs-on: ubuntu-20.04
    strategy:
      max-parallel: 4
      matrix:
        db: [sqlite, postgres, mysql]
        python-version: [3.6, 3.7, 3.8]
        include:
          - db: postgres
            db_port: 5432
          - db: mysql
            db_port: 3306

    services:
      postgres:
        image: postgres:10
        env:
          POSTGRES_USER: postgres
          POSTGRES_PASSWORD: hunter2
        options: >-
          --health-cmd pg_isready
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5
        ports:
          - 5432:5432
      mysql:
        image: mysql:5.7
        env:
          MYSQL_ROOT_PASSWORD: hunter2
        ports:
          - 3306:3306
        options: >-
          --health-cmd="mysqladmin ping"
          --health-interval=10s
          --health-timeout=5s
          --health-retries=3
    steps:
    - uses: actions/checkout@v2
    - name: Set up Python ${{ matrix.python-version }}
      uses: actions/setup-python@v2
      with:
        python-version: ${{ matrix.python-version }}
    - name: Install Dependencies
      run: |
        python -m pip install --upgrade pip
        pip install -r requirements.txt
        pip install braintree mysqlclient apprise
    - name: Run Tests
      env:
        DB: ${{ matrix.db }}
        DB_HOST: 127.0.0.1
        DB_PORT: ${{ matrix.db_port }}
        DB_PASSWORD: hunter2
      run: |
        python manage.py test

And here is the timing of a sample run:

Making It Quick

One thing that bugged me was the database containers took around one minute to initialize. Additionally, both the PostgreSQL and MySQL would initialize on all jobs, even the jobs only needing SQLite. This is not a huge issue, but, my inner hacker still wanted to see if the workflow can be made more efficient. With a little research, I found the Github Actions runner images come with various preinstalled software. For example, the “ubuntu-20.04” image I was using has both MySQL 8.0.22 and PostgreSQL 13.1 preinstalled. If you are not picky about database versions, these could be good enough.

I also soon found the install scripts Github uses to install and configure the extra software. For example, this is the script used for installing postgres. One useful piece of information I got from looking at the script is: it does not set up any default passwords and does not make any changes to pg_hba.conf. Therefore I would need to take care of setting up authentication myself.

I dropped the services section and added new steps for starting the preinstalled databases. I used the if conditionals to only start the databases when needed:

name: Django CI

on:
  push:
    branches: [ master ]
  pull_request:
    branches: [ master ]

jobs:
  test:

    runs-on: ubuntu-20.04
    strategy:
      matrix:
        db: [sqlite, postgres, mysql]
        python-version: [3.6, 3.7, 3.8]
        include:
          - db: postgres
            db_user: runner
            db_password: ''
          - db: mysql
            db_user: root
            db_password: root

    steps:
    - uses: actions/checkout@v2
    - name: Set up Python ${{ matrix.python-version }}
      uses: actions/setup-python@v2
      with:
        python-version: ${{ matrix.python-version }}
    - name: Start MySQL
      if: matrix.db == 'mysql'
      run: sudo systemctl start mysql.service
    - name: Start PostgreSQL
      if: matrix.db == 'postgres'
      run: |
        sudo systemctl start postgresql.service
        sudo -u postgres createuser -s runner
    - name: Install Dependencies
      run: |
        python -m pip install --upgrade pip
        pip install -r requirements.txt
        pip install apprise braintree coverage coveralls mysqlclient
    - name: Run Tests
      env:
        DB: ${{ matrix.db }}
        DB_USER: ${{ matrix.db_user }}
        DB_PASSWORD: ${{ matrix.db_password }}
      run: |
        coverage run --omit=*/tests/* --source=hc manage.py test
    - name: Coveralls
      if: matrix.db == 'postgres' && matrix.python-version == '3.8'
      run: coveralls
      env:
        GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

And the resulting timing:

There is more time to be gained by optimizing the “Install Dependencies” step. Github Actions has a cache action which caches specific filesystem paths between job runs. One could figure out the precise location where pip installs packages and cache it. But this is where I decided it was “good enough” and merged the workflow configuration into the main Healthchecks repository.

In summary, first impressions: messing with Github Actions is good fun. The workflow syntax documentation is good to get a quick idea of what is possible. The workflow definition ends up being longer than the old Travis configuration, but I think the extra flexibility is worth it.

Codeship Security Incident’s Impact on Healthchecks.io

Recently, Codeship reported a security incident: a database containing their production data had been exposed for over a year.

I have a Codeship account. I also have a Bitbucket account, and my Codeship account was authorized to access it. In the Bitbucket account, I have a private repository with various secrets (API keys, access tokens) that Healthchecks.io production environment uses to talk with the database and external services.

I was storing the secrets in the Bitbucket in an unencrypted form. If leaked, they would allow an attacker to do things like:

  • send emails impersonating Healthchecks.io (bad)
  • send SMS messages from Healthchecks.io number (bad, can get expensive)
  • access Healthchecks.io customers’ billing addresses (very bad)

I have found no evidence that my private Bitbucket account was accessed or any secrets leaked. At the same time, I cannot conclusively prove it was not possible for the secrets to leak.

I debated with myself whether I should write this post. It boils down to a moral dilemma: do I write a full disclosure about what is likely a non-event, and risk a reputation hit? Or do I handle it quietly and just let it pass?

Codeship and Bitbucket

I found Codeship’s security notification on October 2 in my spam folder. According to Codeship, their database was exposed from June 2019 to June 2020. There is evidence that the attacker was actively using data from their database.

I am not using Codeship for anything Healthchecks-related, but I had granted Codeship access to my Bitbucket account for an unrelated project. The grant was required to set up automatic Codeship builds on each Bitbucket commit. Unfortunately, it looks like Codeship asks for way too many permissions:

I wish I had read this carefully and thought about the implications when I was setting this up back in January 2016.

After granting access, Bitbucket gives Codeship an OAuth access token, which ends up in Codeship’s database. Using the OAuth token, Codeship (or its attacker) can access and manipulate the user’s repositories.

After becoming aware of the incident, I revoked Codeship’s access from Bitbucket, and checked all my repositories for any rogue access keys or unexpected changes. I asked Codeship support if the Bitbucket access token could have been exposed. They said:

The Bitbucket token was potentially exposed. I would recommend revoking this token and looking through your repo history for any suspicious activity, although so far we have not heard of any Bitbucket-focused activity.

I asked Bitbucket support for API call logs. They did prepare and send a report, but unfortunately they only have data from the previous 3-4 weeks.

Healthchecks.io Secrets in Bitbucket

In Bitbucket, I have a private repository with deployment scripts. These are Fabric scripts for bootstrapping new machines, deploying and updating software, and various maintenance tasks. The repository also contains a local_settings.py file with all the production API keys and tokens the Healthchecks app needs to run.

Storing unencrypted secrets in version control is, of course, a mistake on my part. I have now fixed this and am now using Mozilla’s sops to encrypt the secrets. The GPG encryption key is on a Yubikey.

Rotating Secrets

Rotating all secrets was the most time consuming and stress-inducing part. Most of the secrets I had never rotated before, so I had to figure out a safe update procedure for each.

Database. One of the secrets was the PostgreSQL database password. Access to the database is also restricted at the firewall and the pg_hba level, so the password alone is not enough to access the database. Of course, I still wanted to change the password. The procedure was:

  • Create a secondary database user
  • Switch all services to use the secondary user
  • Update the password of the primary database user
  • Switch all services back to use the primary user

Sounds simple enough, but it involved lots of planning, checking, and double-checking. White knuckles and lots of coffees.

Services that support seamless key rotation: AWS SMTP credentials, Braintree, Matrix, Pushbullet, Sentry, Twilio. The process:

  • Create a secondary API key while the primary key still works
  • Update all services to use the secondary key
  • Promote the secondary token to the primary (or delete the primary, leaving just one)

I updated the keys one-by-one, checking everything after each step. In some cases, I kept the old key active for a couple days to make sure no production system was still using it.

Services that only support resetting the keys in place: Discord, LINE Notify, OpsDash, Pushover, Slack, Telegram. For these, the process is “quick hands.” Regenerate the secret, then update the production machines with the new secret as quickly as possible. Again, going through them one by one and testing everything after each change.

SSL certificates. I had the SSL certificates for healthchecks.io, hc-ping.com, and hchk.io purchased from Namecheap. They do support reissuing certificates and revoking the old ones. One complication here was that, for each hostname, I was using both RSA and ECDSA certificates (provisioned as instructed by Namecheap: issue RSA first, then reissue as ECDSA). To minimize the chance of making a mistake, I decided to purchase and deploy new certificates from a different provider and only afterwards revoke the existing certificates. I discovered SSLMate, and got the certificates from them. I liked the process of ordering certificates from the command line. It’s a big improvement over the usual process of copy-pasting CSRs around. I installed the new certificates, tested the setup thoroughly, and revoked the old certificates a few days later.

There is still one task on my list: Django’s SECRET_KEY setting. I’m investigating the impact of changing it. One issue here is that Healthchecks.io uses SECRET_KEY as one component in a hash function when generating the badge URLs. Naively changing SECRET_KEY would invalidate all existing badge URLs. Coupling badge URL generation and SECRET_KEY was a seemingly small design decision, and now it comes biting me in the rear years later! I will deal with this but wanted to get this post out with no further delay.

Closing Thoughts

As I wrote in the beginning, I have no evidence that any Healthchecks.io secrets have been leaked or used. Still, they should have not been accessible to third parties (Bitbucket and Codeship) in the first place – I’ve now fixed this mistake.

As always, if you have any questions, please write to contact@healthchecks.io.

– Pēteris, Healthchecks.io

About Tracking Cookies on status.healthchecks.io

status.healthchecks.io used to set an “ajs_anonymous_id” tracking cookie. I’m happy to report that it does not do that anymore since September 22, 2020. In this post, I’ll share the process I went through to get the tracking cookie removed.

For powering status.healthchecks.io, I am using a third-party hosted status page provider, Statuspage.io, by Atlassian. I initially set it up in May 2020 and wrote about it on this blog. After the setup, while poking around, I discovered my fancy new status page sets a tracking cookie. It does not ask for the user’s consent, and it does not obey the “DNT” header – when you visit the page, you get a tracking cookie. 

I believe this cookie was only used for innocuous purposes (tracking the number of unique page visitors), but it still invades site visitors’ privacy and violates GDPR requirements. On May 7, I submitted a support ticket asking to remove the tracking cookie and got a reply with a bottom line: “We can’t avoid setting these cookies.” After asking again, I got back a non-commital “I will forward this to our product team and development team,” and that was that. 

I had already invested a significant amount of time setting up automation and custom metrics for the status page. And, aside from the cookie issue, I was generally happy with the product. Before switching providers over this one issue, I wanted to take a crack at fixing it. It was unlikely Atlassian would spend any engineering resources just because a single $29/mo customer had complaints. So I needed to bump up the priority of the issue. I searched around for other Statuspage.io customers and started contacting them. My email template went through several iterations until I got to a version that felt transparent and not manipulative:

Subject: Cookies on status.somedomain.com
Hello,

when I visit status.somedomain.com I see it stores the following cookies in my browser:

* ajs_anonymous_id
* ajs_group_id

These are Atlassian’s tracking cookies. They are not essential, and so under GDPR they require the user’s explicit opt-in before they can be sent to the browser.

I am an Atlassian Statuspage customer myself, and my service’s status page has the exact same problem. I’ve contacted Atlassian about this but this appears to be low priority for them.

I am contacting you because I think more affected customers being aware of the issue and asking Atlassian to fix it = higher chance that they will actually do something.

Thanks,
Pēteris Caune

I started by manually sending ten or so emails out every week. I mostly got sympathetic and cooperative responses. There were some funny ones too. For example, one guy insisted that there is no problem because he could not reproduce the issue using “internal methods.” Me showing him the results of several different cookie scanning services (cookie-script.com, cookiebot.com) did not sway him.

I kept contacting other companies, and they sometimes forwarded me the responses they were getting from Atlassian. From these responses, it didn’t look like we were making much progress. In July, two months in, I decided to amp things up. I grabbed the Majestic Million dataset with the top million websites. I wrote a script that goes through the list, and, for each website, checks if it has an Atlassian-operated “status” subdomain. The script produced an HTML page with filtered results and “mailto:” links, to help me send out the emails. Side note: did you know the “mailto:” links can specify the message body?

To find email addresses, I found the best way was to look at each website’s privacy policy and search for the “@” symbol. I found typical contact addresses were privacy@somedomain.com and dpo@somedomain.com (where “dpo” stands for Data Protection Officer). On July 26-27, one by one, I sent out emails to around 200 companies.

The wave of new support tickets from various companies worked. Atlassian started communicating back a plan to implement a cookie consent banner in Q1 2021. Later in August, they started saying “late September 2020”. I held off from sending more emails and waited to see what would happen in September.

On September 22, I received an update from Atlassian. Instead of implementing a cookie consent banner, they decided to drop the Page Analytics feature, which was responsible for the tracking cookie. From my point of view, this is the best possible outcome – no tracking cookie and no consent banner. Statuspage.io still has an option of adding a Google Analytics tag. So, there still is a way to track the unique visits for those who need it. 

Thank you, Atlassian / Statuspage.io, for implementing this change. I appreciate it! To my contact at Atlassian support, thank you for your patience. 

To everyone who also contacted Atlassian about the tracking cookies, thank you! It took a team effort, but it worked out in the end!

– Pēteris

Database Fail-over on May 20

On May 20, the primary database server of Healthchecks.io experienced packet loss and latency issues. To ensure normal operation of the service, the database was failed-over to a hot standby. The fail-over went fine, aside from a minor issue with one fat-fingered firewall rule. Below are additional details about the network issue and the fail-over process.

The Network Issue

The symptoms: network latency and packet loss on the database host shoots up for 1-2 minutes, then everything goes back to normal. This had occurred a few times in the past weeks already, causing ping processing delays and subsequent “X is down” alerts each time.

This issue has been hard to troubleshoot because it seemed to happen at random times, and lasted only for a couple minutes each time. I was running a few monitoring tools continuously: Netdata agent including the fping plugin, OpsDash agent, and mtr in a loop logging to a text file. I could also inspect application logs for clues.

To illustrate the elevated latency, here’s how pinging 8.8.8.8 from the host looks normally:

$ ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=57 time=4.93 ms
64 bytes from 8.8.8.8: icmp_seq=2 ttl=57 time=4.96 ms
64 bytes from 8.8.8.8: icmp_seq=3 ttl=57 time=4.94 ms
64 bytes from 8.8.8.8: icmp_seq=4 ttl=57 time=4.95 ms
(…)

And here’s the same command during the problematic 2-minute window:

$ ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=57 time=104 ms
64 bytes from 8.8.8.8: icmp_seq=3 ttl=57 time=104 ms
64 bytes from 8.8.8.8: icmp_seq=4 ttl=57 time=102 ms
64 bytes from 8.8.8.8: icmp_seq=5 ttl=57 time=104 ms
64 bytes from 8.8.8.8: icmp_seq=6 ttl=57 time=104 ms
64 bytes from 8.8.8.8: icmp_seq=7 ttl=57 time=104 ms
64 bytes from 8.8.8.8: icmp_seq=8 ttl=57 time=104 ms
64 bytes from 8.8.8.8: icmp_seq=9 ttl=57 time=100 ms
64 bytes from 8.8.8.8: icmp_seq=11 ttl=57 time=104 ms
64 bytes from 8.8.8.8: icmp_seq=12 ttl=57 time=104 ms
64 bytes from 8.8.8.8: icmp_seq=13 ttl=57 time=104 ms
64 bytes from 8.8.8.8: icmp_seq=14 ttl=57 time=104 ms
64 bytes from 8.8.8.8: icmp_seq=16 ttl=57 time=104 ms
^C
--- 8.8.8.8 ping statistics ---
16 packets transmitted, 13 received, 18% packet loss, time 14996ms
rtt min/avg/max/mdev = 100.713/104.111/104.729/1.207 ms

Impact

When the latency from app servers and the database suddenly goes from 0.5ms to 100ms, there are going to be issues. The most pressing issue was the processing of incoming pings. On app servers, each received ping is put in a queue. A single worker process (a goroutine, to be exact) takes items from the queue and inserts them in the database. This is a sequential process and the latency to the database puts a limit on how many pings can be processed per second. To illustrate, under normal operation the worker can process all incoming pings without a backlog building up:

May 22 13:06:00 www1 hchk[13236]: 47 pings/s
May 22 13:06:01 www1 hchk[13236]: 105 pings/s
May 22 13:06:02 www1 hchk[13236]: 328 pings/s
May 22 13:06:03 www1 hchk[13236]: 265 pings/s
May 22 13:06:04 www1 hchk[13236]: 108 pings/s
May 22 13:06:05 www1 hchk[13236]: 72 pings/s

During the high-latency period, throughput drops significantly and backlog starts to build up (timestamps in CEST):

May 20 17:39:49 www1 hchk[21015]: 5 pings/s, queued 47, dwell time 2373ms
May 20 17:39:50 www1 hchk[21015]: 5 pings/s, queued 68, dwell time 3358ms
May 20 17:39:51 www1 hchk[21015]: 6 pings/s, queued 104, dwell time 3696ms
May 20 17:39:52 www1 hchk[21015]: 3 pings/s, queued 180, dwell time 4624ms
May 20 17:39:53 www1 hchk[21015]: 5 pings/s, queued 241, dwell time 5268ms
May 20 17:39:54 www1 hchk[21015]: 3 pings/s, queued 268, dwell time 6116ms
May 20 17:39:55 www1 hchk[21015]: 4 pings/s, queued 292, dwell time 6975ms
May 20 17:39:56 www1 hchk[21015]: 5 pings/s, queued 325, dwell time 7819ms
May 20 17:39:57 www1 hchk[21015]: 3 pings/s, queued 340, dwell time 8445ms

When the dwell time (the age of the oldest item in the queue) goes above 15 seconds, the app server declares itself unhealthy. When all app servers are unhealthy, clients start getting “HTTP 503 Service Unavailable” errors.

Even if the network hiccup is brief, and no client request gets rejected, there is still a negative consequence: there is a delay from the client making a HTTP request, to the ping getting registered in the database. For a check with a tight timing, this additional delay can push it over the limit and cause a false alert.

Fail-over

On May 20, after another latency spike, I decided to fail-over the database to a hot standby. I didn’t know what was ultimately causing the network issues, but the hope was that moving to a different physical host will make things better.

The fail-over procedure is simple:

  • make sure there is no replication lag
  • stop the primary
  • promote the hot standby

The application servers didn’t need to be updated or restarted – they knew IP addresses of both database hosts, and could automatically cycle between them until they found one that was alive and was accepting writes. More details on this in PostgreSQL documentation: Specifying Multiple Hosts and target_session_attrs.

The only issue I ran into during the fail-over was with a firewall rule denying one of the app servers access: I had mistyped port “5432” as “4322”, which was an easy fix.

Preparing a New Hot Standby

The service was back to normal operation, but the database host was now lacking a hot standby. Luckily, I had a fresh system (sporting Ryzen 3700X) already ordered and waiting on the sidelines. I used provisioning scripts to set up a hot standby on it, but checking and double-checking everything still took close to an hour.

Contacting Hetzner Support

I tried contacting Hetzner Support about the packet the loss and latency jump. I provided mtr reports (which they always ask for), and ping logs showing latency. After some forth-and-back ultimately the response I got was “There is and was no general network issue known. Also we do not see any packet loss from here.”

To be fair, this could have been an issue with the hardware or software on the physical host – I can’t say for sure.

Workarounds: Parallel Processing, Batching, Dedicated Queue?

As mentioned earlier, each app server processes pings one-by-one, over a single database connection. When a single operation blocks, the whole queue gets stuck. What if we used a connection pool? Or processed pings in batches – say, 100 at a time? Or added a dedicated queue component to the mix?

These are all interesting ideas worth of consideration. I’ve already done some work on the batch processing idea and have promising results. That being said, packet loss would be problematic even with these workarounds. It would need to be solved anyway.

As always, I will continue to look for opportunities to make the service more robust.

Pēteris,
Healthchecks.io

Healthchecks.io Status Page Facelift

The Healthchecks.io system status page at status.healthchecks.io recently received a revamp. Here are my notes on the new version.

First up, the components section shows the current and historic status of components:

Dashboard shows the status of the main website, healthchecks.io. “Operational” state here means the website responds to HTTP requests, and has a working connection to the PostgreSQL database. Checkly, an external uptime monitoring service, monitors the website and automatically updates the component’s status via Statuspage.io API. Checkly has powerful and flexible webhook notifications which makes this possible.

Ping API shows the status of the ping endpoint, hc-ping.com. “Operational” state means hc-ping.com is responding to HTTP requests and is inserting pings in the database with no excessive delay. Although the ping endpoint and the main website runs on the same physical servers, they use different software. So it makes sense to monitor them separately. The status of this component is updated automatically by Checkly, same as the dashboard.

Notification Sender shows the status of the background process sending out notifications. Status updates of this component are not automated yet.


Below the components is the “System Metrics” section with four metrics.

  • Processed pings is the number of valid ping requests (valid UUID, not rate limited) processed per second.
  • Queued incoming pings is the number of pings that have been received but not yet inserted in the database. A spike suggests either a database problem, or a connectivity issue inside the data center.
  • Notifications sent is the number of notifications sent per minute.
  • Queued outgoing notifications is the number of scheduled notifications waiting to be sent out. A growing number means either the notification sender is not working, or it cannot keep up.

I used the following criteria for picking the metrics to show:

  • The metric should tell something useful about the system.
  • The metric should be simple to explain. For example, I internally track a few different “queue dwell time” metrics. They are useful, but it would be hard to explain what they mean precisely, and how to interpret them.
  • The metric should be computationally inexpensive to measure. It should not require a heavy database query.

I considered several ways of measuring, aggregating and submitting the metrics. I ultimately went with:

  • Each web server exposes a metrics endpoint that an external system can scrape. Here’s a git commit where I added one of the endpoints.
  • On an external host, a script runs once per minute (via cron of course). It scrapes metrics data from each web server, then processes and submits it to Statuspage.io using their Metrics API. The script is less than 100 lines long.

If you notice gaps in metrics graphs, it could be because the external metrics collector has failed. There are ways to make the metrics collection more robust, but the current simple setup seems to work fine for now.


The final feature in the status page is Incidents. Currently I have not automated incident creation in any way. The plan is to manually open an incident when I become aware of it, and backdate it as makes sense. To test out the Incidents feature, I backfilled a couple past incidents. For example, Delayed notifications on February 7.

And that is all for now. I hope the new status page does not need to be used often! I will also keep posting outage notifications to @healthchecks_io on Twitter as well.

Happy monitoring and meta-monitoring,
Pēteris,
Healthchecks.io

Incident Report – 7 February 2020

On February 7, Healthchecks.io experienced an issue with sending notifications. An invalid cron expression slipped into the system, which caused the notification sending jobs to crash and restart in a loop. Timeline (all times are in UTC, and from February 7):

  • 0:37: a check with a bad cron schedule gets created via API
  • 0:41: the check receives its first ping
  • 0:42: one minute later, the notification senders go into a crash-restart loop
  • 1:02: external monitoring alerts go out
  • 7:00: I’ve woken up and found out about the outage
  • 7:28: The invalid cron expression is located and fixed, notification sending resumes
  • 7:52: I post a tweet about the outage
  • 10:00: Deployed mitigations for the “sendalerts” process repeatedly crashing, and stricter cron expression validity checks

This outage started at in the middle of night (2:42 AM local time) and so it took several hours until I found out about it and could jump on to fixing it. During this time, Healthchecks.io was not sending out any notifications (all types: emails, webhooks, Slack alerts, …). On the positive side, the web dashboard, the ping endpoints, the API and the badges were working normally.

After fixing the bad cron schedule, the notification senders resumed work and quickly went through the backlog of unsent notifications:

When notification sending resumed, Healthchecks.io sent out notifications for all checks that had flipped their state once (from “up” to “down”, or from “down” to “up”) during the outage. Unfortunately, it would have missed cases where a check flips twice (for example, “up” → “down” → “up”) during the outage window. If a check went down but came right back up during the outage window, Healthchecks.io missed it and didn’t send a notification.

The Root Cause

The “sendalerts” crash loop was tripping on the following cron schedule: “0 0 31 2 *”. Or, in human words, “at midnight of every February 31st“. The notification sender was crashing while calculating the next expected ping time for this schedule.

The Fix

  1. To get around the immediate crashing problem, I manually edited the problematic cron schedule
  2. In the “sendalerts” management command I added a mitigation for repeatedly crashing on the exact same check. With the mitigation, “sendalerts” postpones the problematic check for 1 hour, so it can process other checks in the meantime.
  3. I added extra validation step for cron expressions. Healthchecks now makes sure it can calculate a valid “next ping time” for a cron expression before allowing it into the system.
  4. When the outage started, I received monitoring alerts from three different services. All three alerts went to email, and I didn’t notice them until the morning. I’ve now updated notification settings to also receive Pushover notifications with the “Emergency” priority. These notifications override phone’s Do Not Disturb settings and repeat until acknowledged.

I apologize to all Healthchecks.io users for any inconvenience caused.

– Pēteris Caune,
Healthchecks.io