Outage Postmortem — 20 August 2016

On August 20, the healthchecks.io service experienced a 24 hour outage. During this time, it was unable to process any incoming pings. This caused a large number of checks to go into the “down” state. This, in turn, caused a large number of “Your check has gone down” alerts to be sent out.

Summary

The direct cause was an unattended upgrade which restarted the PostgreSQL database process. The database restart alone should have only caused a brief downtime. Unfortunately, one of the database clients was not prepared to deal with unexpectedly closed database connections. It continuously tried to use the closed connection–and failed every time.

Meanwhile, the single maintainer of healthchecks.io website (which is me, Pēteris) was out and about, unreachable by email, and unaware of any of the incoming monitoring alerts or twitter messages.

Technical Details

The healthchecks.io service consists of the following components, each running on a separate virtual machine hosted at DigitalOcean, running Ubuntu 16.04:

  • a server running the PostgreSQL database
  • a server running the ping listener service. This is a NGINX web server reverse-proxying a small, custom Go program. Let’s call the program “hchk.go”. The hchk.io address points to this server.
  • a server running the Django website. This is a caddy web server, reverse-proxying an uwsgi process. The healthchecks.io address points to this server.
  • a server running the Django management command for sending out alerts.

Out of the box, an Ubuntu 16.04 server comes with unattended security updates enabled. Late August 19, the database server installed a security update for PostgreSQL. It then proceeded to restart the database process. During the restart, any open database connections are of course closed.

After the database restart, the website and the alerting service continued to work normally. They use Django framework and psycopg2 database driver, which together take care of opening new connections as is necessary.

The hchk.go program, however, is written in a different programming language (Go), and uses a different database driver (pgx). The program was not tested in the scenario where the database closes a connection. Its error handling for failed SQL operations was effectively a “write the error to a log file and march on”. So, when the database restarted, hchk.go was stuck with a dead connection, and returned HTTP 400 “Bad Request” messages to each and every incoming ping. A simple restart of the hchk.go process fixed the immediate problem.

To make it clear, there is nothing wrong with the Go programming language or the database driver used. They were just used incorrectly by the author of the hchk.go program (me). The reason for writing a custom Go program in the first place is performance. Before implementing the hchk.go program, server’s CPU usage was starting to get dominated by the incoming pings. CPU was mostly spent running Python code. After the switch, CPU usage dropped significantly, and currently CPU is mostly spent on TLS handshakes.

Monitoring

The healthchecks.io website has a few provisions for monitoring itself:

The monitoring worked as expected: within 5 minutes after the database restart there were alerts sent out to me. Unfortunately no-one was around for the next 24 hours to see the alerts. Had I checked my email right after the issue started, it would still have taken about 300 kilometers and 4+ hours to get to a PC with the necessary SSH keys.

Next Steps

  • Disable unattended upgrades. Done on the database server. Will do on the other servers during next upgrade cycle.
  • Fix the hchk.go application to handle closed or misbehaving database connections. Done.
  • Keep mobile roaming data enabled next time I go on a trip abroad, and read emails.
  • Set up a small laptop with development and deployment tools, and take it on the longer trips.

In closing: I apologize to all healthchecks.io users for any inconvenience caused. For a monitoring service, any downtime is unacceptable.

Being run on a shoestring budget, healthchecks.io can only offer a best effort availability. I, however, welcome the challenge and will aim to make the best of the resources available!

Pēteris Caune,
healthchecks.io

DIY SSL Certificate Expiry Monitoring

In this post, we will set up a simple SSL certificate expiry monitoring, using cron, ssl-cert-check script and a fail-safe provided by an external service, healthchecks.io.

Let’s say you administer a website, namely example.com. You want to be sure that the SSL certificate of example.com is always renewed in time. Like seldom-used passwords, the annual or bi-annual certificate expiry date is easy to forget about until it is too late. You want to set up an automated system that will remind you 30 days before the certificate expires. Finally, you want something that will “guard the guards themselves”. If your SSL monitoring setup breaks down, you want to be notified about that as well!

The main building block will be the ssl-cert-check script. On Ubuntu and Debian you can install it with a simple

apt-get install ssl-cert-check

Here are its options:

$ ssl-cert-check 
Usage: /usr/bin/ssl-cert-check [ -e email address ] [ -x days ] [-q] [-a] [-b] [-h] [-i] [-n] [-v] { [ -s common_name ] && [ -p port] } || { [ -f cert_file ] } || { [ -c certificate file ] }

-a : Send a warning message through E-mail
-b : Will not print header
-c cert file : Print the expiration date for the PEM or PKCS12 formatted certificate in cert file
-e E-mail address : E-mail address to send expiration notices
-f cert file : File with a list of FQDNs and ports
-h : Print this screen
-i : Print the issuer of the certificate
-k password : PKCS12 file password
-n : Run as a Nagios plugin
-p port : Port to connect to (interactive mode)
-s commmon name : Server to connect to (interactive mode)
-t type : Specify the certificate type
-q : Don’t print anything on the console
-v : Specify a specific protocol version to use (tls, ssl2, ssl3)
-V : Only print validation data
-x days : Certificate expiration interval (eg. if cert_date < days)

Let’s run ssl-cert-check on example.com:

$ ssl-cert-check -s example.com -p 443

Host Status Expires Days
-------------------------------- ------------ ------------ ----
example.com:443 Valid Nov 28 2018 928

Great, the status field says “Valid”. Now, let’s run it using a high certificate expiration interval (-x parameter):

$ ssl-cert-check -s example.com -p 443 -x 1000

Host Status Expires Days
-------------------------------- ------------ ------------ ----
example.com:443 Expiring Nov 28 2018 928

Perfect, the status is now “Expiring”. Now, the plan is to run this command regularly and get alerted as soon as the status is anything other than “Valid”.
The ssl-cert-check script has a few flags that will be useful inside a cron task: -q suppresses console output and -n sets the exit code, depending on the status.

$ ssl-cert-check -s example.com -p 443 -x 30 -n -q
$ echo $?
# prints 0

$ ssl-cert-check -s example.com -p 443 -x 1000 -n -q
$ echo $?
# prints 1

Next, let’s set up a check on healthchecks.io. Log in, and add a new check. Give it a descriptive name like:

A fresh healthchecks.io account for SSL certificate monitoring

Here’s how this will work: the check’s URL will need to be requested at least daily to keep it in the green “up” state. As soon as it is not requested for more than a day, its state will go to a red “down” and you will receive an email notification. If you prefer to be notified differently, in the “Integrations” section, you can set up Slack, HipChat, PagerDuty, VictorOps and Pushover notifications. You can add more email addresses to be notified, and you can also integrate with your notification systems using webhooks.

Now, with the check’s URL handy, we are ready to put together a cron command:

ssl-cert-check -s example.com -p 443 -x 30 -n -q && curl -fsS --retry 3 https://hchk.io/your-uuid-here > /dev/null

Let’s go over what this command does. First, we execute ssl-cert-check command. If the certificate is valid for at least 30 more days, it exits with exit code 0. Otherwise, it exits with a non-zero exit code.

Next, we chain the curl call using && operator. When two commands are delimited with &&, the second command only runs if the first command succeeds. In our case, if the certificate is valid, the curl command will run. If the certificate is expiring or expired, the curl command will not run.

The curl command has a few flags to suppress console output, and to retry transient HTTP failures:

  • -f, –fail Makes curl treat non-200 responses as errors
  • -s, –silent Silent or quiet mode. Don’t show progress meter or error messages
  • -S, –show-error When used with -s it makes curl show error message if it fails.
  • –retry <num> If a transient error is returned when curl tries to perform a transfer, it will retry this several times before giving up. Setting the number to 0 makes curl do no retries (which is the default). Transient error means either: a timeout, an FTP 4xx response code or an HTTP 5xx response code.

Finally, we redirect curl’s output to /dev/null. If cron runs a command and the command outputs anything to the console, cron will email the output, and here we do not want that.

You can now see how this will all work together: if the ssl-cert-check command returns success, the curl command will run and keep the check in the green “up” state. If the ssl-cert-check command returns a failure, the curl command will not be run and the check will go down. And if something happens to the whole machine running cron, the check will also go down. When the check goes down for either reason, healthchecks.io will send you an alert.

Now it is time to add this command to cron. Pick or launch an Ubuntu or Debian machine you expect to be up and running for a long time. If you have a machine that’s dedicated to doing backups and similar background jobs, that is perfect. Log in as unprivileged user and use the “crontab -e” command to edit the user’s crontab:

$ crontab -e

In the crontab editor, add this line:

20 7 * * * ssl-cert-check -s example.com -p 443 -x 30 -n -q && curl -fsS --retry 3 https://hchk.io/your-uuid-here > /dev/null

Save the file and you are done. From now on, each day at 7:20 your machine will run an SSL expiry check and then notify healthchecks.io. If the certificate expires in less than 30 days, or if the machine stops working, healthchecks.io will send you a notification.

This is an example of how you can set up simple, reasonably robust monitoring tasks. If this setup seems too hacky for your taste, if you have no appropriate machine to run cron tasks on, or if you are looking for more than just a certificate expiry check, it makes good sense to look into self-hosted or SaaS services for SSL monitoring.

Deploying a Django App with No Downtime

When healthchecks.io started to receive more than 1 request per second, it became clear I could not just go on carelessly restarting web servers after code deploys. For a monitoring service, it would be bad form to miss even a few HTTP requests. And, going forward, if the server gets busier, the problem only becomes bigger.

To give a quick overview of what I’m working with, the app is a relatively straightforward Django app, served by gunicorn behind nginx. Data lives in a PostgreSQL database. The gunicorn process and an additional background job are both managed by supervisor. It’s hosted on a single $20 DigitalOcean droplet.


Aside: With regard to technology choices, the guiding principle I’ve been following is to keep the stack as simple as is feasible for as long as possible. Adding things, like load balancers, database replication, key value store, message queue and so on, would each have certain benefits. Then on the other hand, there would also be more stuff to be managed, monitored, and kept backed up. Also, for someone new to the project, it would take more time to figure out the “ins and outs” of the system and set up everything from scratch. I see it as a nifty challenge to stay with the simple, no-frills setup, while also not compromising performance or features.


The deployment mechanism I’ve used thus far is a Fabric script plus configuration templates for supervisor and nginx. Each time I run “fab deploy” from my workstation, Fabric script does the following on the remote host:

  • sets up a new directory for the new deployment. Let’s refer to this directory as $TARGET.
  • sets up a python3 virtualenv in $TARGET/venv
  • fetches the latest snapshot of code from GitHub into $TARGET. It is convenient to use GitHub’s Subversion interface for this and run a “svn export” command. It produces just the source files without any version control metadata–exactly what’s needed.
  • installs dependencies listed in requirements file. These get installed into the new virtualenv and don’t affect the live application. Downloading and building the dependencies take up to a minute.
  • runs Django management commands to collect static files, run database migrations etc.
  • rewrites the supervisor configuration file to run gunicorn from the new virtual environment
  • updates nginx configuration, in case I’ve changed anything in the nginx configuration template
  • runs “supervisorctl reload” and “/etc/init.d/nginx restart”. At this point the web application becomes unavailable and remains unavailable until supervisor starts back up, launches gunicorn process, and the Django code initializes. This usually takes 5 to 10 seconds, and nginx would typically return “502 Bad Gateway” responses during this time.
  • All done!

Here’s how the relevant part of Fabric script looks. The virtualenv context manager seen below is from the excellent fabtools library.

def deploy():
    """ Checks out code, prepares venv, runs management commands,
    updates supervisor and nginx configuration. """

    now = datetime.datetime.today()
    now_string = now.strftime("%Y%m%d-%H%M%S")
    project_dir = "/home/hc/webapps/hc-%s" % now_string
    venv_dir = os.path.join(project_dir, "venv")

    svn_url = "https://github.com/healthchecks/healthchecks/trunk"
    run("svn export %s %s" % (svn_url, project_dir))

    with cd(project_dir):
        run("virtualenv --python=python3 --system-site-packages venv")
        # local_settings.py is where things like access keys go
        put("local_settings.py", ".")
        put("newrelic.ini", ".")

        with virtualenv(venv_dir):
            run("pip install -U gunicorn raven newrelic")
            run("pip install -r requirements.txt")
            run("python manage.py collectstatic --noinput")
            run("python manage.py compress")

            with settings(user="hc"):
                run("python manage.py migrate")
                run("python manage.py ensuretriggers")
                run("python manage.py clearsessions")

    switch(project_dir)

def switch(project_dir):
    # Supervisor
    upload_template("supervisor/hc.conf.tmpl",
                    "/etc/supervisor/conf.d/hc.conf",
                    context=locals(),
                    backup=False,
                    use_sudo=True)

    upload_template("supervisor/hc_sendalerts.conf.tmpl",
                    "/etc/supervisor/conf.d/hc_sendalerts.conf",
                    context=locals(),
                    backup=False,
                    use_sudo=True)

    # Nginx
    upload_template("nginx/hc.conf.tmpl",
                    "/etc/nginx/sites-enabled/hc.conf",
                    context=locals(),
                    backup=False,
                    use_sudo=True)

    sudo("supervisorctl reload")
    sudo("/etc/init.d/nginx reload")

Now, how to eliminate the downtime during the last steps of each deploy?Let’s set some constraints: no load balancer (for now anyway). Everything runs off a single box, and even a single non-200 response is undesirable. And, baby steps: I will consider the simple (and common) case when there are no database migrations to be applied or they are backwards-compatible: the old version of the app keeps working acceptably after the migrations are applied.

The first idea I looked into was based on the observation that availability is more important for some parts of the app than others. Specifically, the API part of the app listens for pings from the monitored client systems, and the frontend part serves pages to normal website visitors. While it would be embarrassing to show error pages to human visitors, not missing any pings is actually more important. A missed ping can lead to a false alert being sent sometime later. That’s even more embarrassing!

I considered and prototyped listening to pings using Amazon API Gateway. It would put ping messages in Amazon SQS queue, which the Django app could consume at its leisure. This would be a relatively simple way to improve availability and scalability by quite a lot at the cost of somewhat increased complexity and a new external dependency. I might look into this again in future.

Next idea: separate the “listen to pings” functionality from the rest of the Django app. The ping listener logic is very simple and, ultimately, amounts to two SQL operations: one update and one insert. It could be easy enough to rewrite this part, perhaps using one of the python microframeworks, or maybe using a language other than Python, or maybe even handle it from nginx itself, using ngx_postgres module. For a little amusement, here’s the nginx configuration fragment which, basically, works as-is (please forgive the funny looking regular expression):

location ~ ^/(\w\w\w\w\w\w\w\w-\w\w\w\w-\w\w\w\w-\w\w\w\w-\w\w\w\w\w\w\w\w\w\w\w\w)/?$ {
    add_header Content-Type text/plain;

    postgres_pass   database;
    postgres_output value;

    postgres_escape $ip $remote_addr;
    postgres_escape $agent =$http_user_agent;
    postgres_escape $body =$request_body;

    postgres_query "
        WITH t AS (
            UPDATE api_check
            SET last_ping=now()
            WHERE code='$1'
            RETURNING id, last_ping
        )
        INSERT INTO api_ping
            (created, remote_addr, method, ua, body, owner_id, scheme)
        SELECT
            last_ping, $ip, '$request_method', $agent, $body, id, '$scheme'
        FROM t
        RETURNING 'OK'
    ";

    postgres_rewrite no_changes 400;
}

Here’s what’s going on: when the client requests and the URL of a certain format, the server runs a PostgreSQL query and returns either HTTP code 200 or HTTP code 400. This is also a performance win, because the request doesn’t have to travel through the hoops of gunicorn, Django and psycopg2. As long as the database is available, nginx can handle the ping requests, even if the Django application is not running for any reason.

The not so great thing with this approach is that it’s “tricky” and adds to the number of things that the developer and systems administrator need to know. For example, when the database schema changes, the SQL query above might need to be updated and tested as well. Getting the ngx_postgres extension set up isn’t a simple matter of “apt-get install” either.

Thinking more about it, the main goal of zero downtime can also be achieved by just carefully orchestrating process restarts and reloads.

My deployment script was using “/etc/init.d/nginx restart” because I didn’t know any better. As I learned, it can be replaced it with “/etc/init.d/nginx reload” which handles things gracefully:

Run service nginx reload or /etc/init.d/nginx reload

It will do a hot reload of the configuration without downtime. If you have pending requests, then there will be lingering nginx processes that will handle those connections before it dies, so it’s an extremely graceful way to reload configs.

– “Nginx config reload without downtime” on ServerFault

Similarly, my deployment script was using “supervisorctl reload” which stops all managed services, re-reads configuration, and starts all services. Instead “supervisorctl update” can be used to start, stop and restart the changed tasks as necessary.

Now, here’s what “fab deploy” can do:

  • set up a new virtual environment as before
  • create a supervisor task with unique name (“hc_timestamp”)
  • start the new gunicorn process alongside the running one. nginx talks to gunicorn processes using UNIX sockets, and each process uses a separate, again timestamped, socket file
  • wait a little–then verify that the new gunicorn process has started up and is serving responses
  • update nginx configuration to point to the new socket file and reload nginx
  • stop the old gunicorn process

Here’s the improved part of Fabric script which juggles supervisor jobs:

def switch(tag, project_dir):
    # Supervisor
    supervisor_conf_path = "/etc/supervisor/conf.d/hc_%s.conf" % tag
    upload_template("supervisor/hc.conf.tmpl",
                    supervisor_conf_path,
                    context=locals(),
                    backup=False,
                    use_sudo=True)

    upload_template("supervisor/hc_sendalerts.conf.tmpl",
                    "/etc/supervisor/conf.d/hc_sendalerts.conf",
                    context=locals(),
                    backup=False,
                    use_sudo=True)

    # Starts up gunicorn from the new virtualenv
    sudo("supervisorctl update")

    # Give it some time to start up
    time.sleep(5)

    # Let's check the new server is nominally working
    # gunicorn listens on UNIX socket so this is a bit contrived:
    l = ("GET /about/ HTTP/1.0\\r\\n"
         "Host: healthchecks.io\\r\\n"
         "\\r\\n")

    cmd = 'echo -e "%s" | nc -U /tmp/hc-%s.sock' % (l, tag)
    # Look for known string in response. If it's not found, something
    # is wrong with the new deployment and we abort
    assert "Monkey See Monkey Do" in run(cmd, quiet=True)

    # nginx
    upload_template("nginx/hc.conf.tmpl",
                    "/etc/nginx/sites-enabled/hc.conf",
                    context=locals(),
                    backup=False,
                    use_sudo=True)

    sudo("/etc/init.d/nginx reload")

    # should be live now - remove supervisor conf for previous versions
    s = sudo("for i in /etc/supervisor/conf.d/*.conf; do echo $i; done")
    for line in s.split("\n"):
        line = line.strip()
        if line == supervisor_conf_path:
            continue
        if line.startswith("/etc/supervisor/conf.d/hc_2"):
            sudo("rm %s" % line)

    # This stops gunicorn processes
    sudo("supervisorctl update")

With this, nginx is always serving requests, and is talking to a live gunicorn process at all times. To verify this in practice, I wrote a quick script that requests a particular URL again and again in an infinite loop. As soon as it hits a non-200 response, it would print out a hard-to-miss error message. With this banging against my test VM, I did a couple deploys and saw no missed requests. Success!

Summary

There are many ways to achieve zero downtime during code deploys, and each has its own trade-offs. For example, a reasonable strategy is to extract the critical parts out of the bigger application. Each part can then be updated independently. Later, the parts can also be scaled independently. The downside to this is more code and configuration to maintain.

What I ultimately ended up doing:

  • hot-reload supervisor and nginx configurations instead of just restarting them. Obvious thing to do in retrospect.
  • make sure the new gunicorn process is alive and being used by nginx before stopping the old gunicorn process.
  • and keep the whole setup relatively simple. As the project gets more usage, I will need to look at performance hotspots and figure out how to scale horizontally, but this should do for now!

How to Integrate healthchecks.io with PagerDuty

PagerDuty is a well-known incident management system. It provides alerting, on-call scheduling, escalation policies and incident tracking. If you use or plan on using PagerDuty, you can can integrate it with your healthchecks.io account in few simple steps!

Step 1: Add a new service to your PagerDuty account

Log into your PagerDuty account, go to Configuration > Services, and click on “Add New Service”:

Adding service to PagerDuty account

Give it a descriptive name, and select “Use our API directly”. Click on “Add Service” and take note of its API key:

List of services on PagerDuty

Step 2: Add a notification channel to your Healthchecks account

Log into your Healthchecks account, and go to Channels. In “Add Notification Channel” section, select “PagerDuty” and paste the API key from previous step:

Adding a Notification Channel in healthchecks.io

… and done! From now on, when a check goes down, Healthchecks will open a new incident in your PagerDuty account. When the check goes back up again, Healthchecks will resolve the incident. Simple and easy!

Intro


I needed a tool to alert me when my cron jobs silently fail. There is already a number of existing services for this, but it seemed like a fun thing to build myself. So I present to you: healthchecks.io

I am using this myself and it has already been useful for me a couple times. Say, a seemingly benign code change in one service causes my batch job to fail 12 hours later, in the middle of night. Without any monitoring I might be blissfully unaware for days or months, until I need those backups or whatever, but now I get an email alert and can get it sorted in minutes. Sweet!

I licensed this under BSD licence, hoping it might be useful for other people too. It’s such a simple service it feels wrong to charge big bucks for it. You can grab the code from GitHub, run it, extend it, add unicorns or raptors, and so on. Or you can use the hosted service which is free. I cannot make guarantees that I’ll keep the hosted service around for ten years, though. The running costs for me currently are: $5/mo DigitalOcean box, two domain names and SSL certificates, a bit of space on S3 for daily backups, and my own time for maintaining this.

On implementation side, it’s a pretty straightforward Django app with nothing particularly clever going on, which is a good thing. It does make use of a database trigger (which works with both PostgreSQL and MySQL), and it has neat JS horizontal slider widgets for setting duration parameters.

Now, about future plans. After about two months of spare time hacking, I feel healthchecks.io is well into MVP stage. It is already useful for me as-is. There’s a list of features I’m considering, but I also want to keep the code base simple, with few dependencies, and easy to deploy. I may add bits for reliability, integrations with other services, but probably no big new features like active checks (think Pingdom), status pages (statuspage.io), installable monitoring agents (NewRelic and many others), and so on. Keep it simple!