Investigating Gmail’s “This message seems dangerous”

I’ve been receiving multiple user reports that Gmail shows a red “This message seems dangerous” banner above some of the emails sent by Healthchecks.io. I’ve even seen some myself:

Gmail’s “This message seems dangerous” banner in action

The banner goes away after pressing “Looks safe”. And then, some time and some emails later, it is back.

It’s hard to know what exactly is causing the “This message seems dangerous” banner. Google of course won’t disclose the exact conditions for triggering it. The best I can do is try every fix I can think of.

Here’s what I’ve done so far.

SPF and DMARC records

For sending emails, Healthchecks.io uses Amazon SES. It’s super easy to set up DKIM records with AWS SES so that was already done from the beginning.

I’ve now also:

  • Configured a custom MAIL FROM domain (“mail.healthchecks.io”)
  • Added SPF and DMARC DNS records, and tested them with multiple online tools
Gmail seems to be happy with SPF, DKIM and DMARC

Update 20 November 2018: In DMARC reports, I’m noticing that a significant number of emails are failing both SPF and DKIM:

A section of DMARC weekly digest from Postmark

Apparently, last week, Google has processed more than 6000 emails that fail both SPF and DKIM checks. I’m thinking of two possibilities:

  • an email forwarder is changing email contents (adding a tracking pixel, adding a “Scanned by Antivirus XYZ” note or similar). With email contents changed, DKIM signatures are no longer valid.
  • somebody is spoofing emails from healthchecks.io addresses

Either way, I want to see if these messages are the culprit, so I’m changing the DMARC policy from “none” to “reject”. This instructs GMail to ignore and throw away email messages that fail both SPF and DKIM checks. Let’s see what happens!

List-Unsubscribe Header

Healthchecks.io notifications and monthly reports have always had an “Unsubscribe” link in the footer. I’ve now also added a “List-Unsubscribe” message header. Gmail seems to know how to use it:

Gmail shows an additional “Unsubscribe” link next to sender’s address. Clicking it brings up a neat confirmation dialog.

Maybe Gmail also looks for it as a spam/not-spam signal. As I said — I’m trying everything.

More Careful Handling of Monthly Reports

I’ looking into reducing the bounce and complaint rates of the monthly reports. The rates as currently reported by AWS SES:

They don’t look too bad, but I’m trying to lower them some more with these two changes:

  • When a monthly report bounces or receives a complaint, automatically and mercilessly disable monthly reports for that address. This was already being done for “XYZ is down” email notifications.
  • If none of user’s checks have received any pings in the last 6 months, then that’s an inactive account: don’t send monthly reports for that user.

Reduce the Number of Links in Emails

Each Healthchecks.io alert email contains a summary of all of the checks in user’s account. For each check, it shows its current status, the date of last received ping, and a link to the check’s “Details” page on the website.

I removed the “Details…” links to see if Gmail is not liking emails with too many links.

7 December 2018: Solved?

I am cautiously optimistic that I’ve solved the issue by tweaking the contents of the emails. I haven’t seen the red Gmail warnings for a while now.

Here’s what happened: I noticed that removing the main content area from the email template makes Gmail’s red banner disappear. So I experimented with removing smaller and smaller chunks from the template until I had narrowed it down to a single CSS declaration:

/* MOBILE STYLES */
@media screen and (max-width: 525px) {
    .mobile-hide {
      display: none !important;
    }
}

This class was used to hide some elements to make things fit on mobile screens. Remove usages of this class: no red banner. Add it back: red banner is back! I tested this a number of times to make sure it was not just a coincidence.

Conclusion

If you are seeing “This message seems dangerous” banner above your own emails here’s one thing you can try: use your existing sending infrastructure to send a bare-bones “Hello World” email and see if the red banner shows up or not. If it doesn’t, then, presumably, something inside your regular email body is triggering it. Selectively remove chunks of the content until you find the problematic element. Change it or remove it.

It is also important to do the other things: set up and validate SPF and DMARC records, test your unsubscribe links, monitor the bounce and complaint rates, monitor email blacklists, etc.

Good luck!

Pēteris,
Healthchecks.io

My One-person SaaS Side Project Celebrates its Third Birthday

First, a TL;DR: on how much money I’m making. Healthchecks.io has around 90 paying customers, and the monthly revenue is a little above $700/mo. The bulk of that goes back into running costs.

About Me

I’m Pēteris Caune, a 34-year-old guy from Latvia. I’m married and the father of a baby daughter. I ride and race mountain bikes. In my day job, I work remotely for a small Irish company. I do Python web applications and Android mobile applications mostly, but it varies a lot.

About Healthchecks

Here’s the elevator pitch (assume a tall building).

Let’s say you just finished setting up a cron job that makes database backups and uploads them to S3. You just ran it by hand, and made sure a new .sql.gz file appeared in S3–all is well! Now, if it stops working one day six months from now, would you notice? The backup job can fail in many ways; here are just a few possibilities:

  • A well-meaning DBA changes the database password, but forgets to update the backup script
  • Slowly, over time, the machine doing backups runs out of disk space
  • Somebody “cleans up” AWS IAM policies and the script cannot upload to S3 anymore
  • Everybody has forgotten which machine and which user account is doing the database backups, and the machine gets decommissioned
  • The machine gets rebooted, backup script now fails because reboots were not tested

Here’s what you can do: edit the backup script to send an HTTP GET request to Healthchecks as the very last step. Healthchecks will treat these requests as “I’m still alive!” messages and will keep track of them. As soon as your service is silent for too long, it will send an alert (configurable: email, SMS, Slack, etc.) to you. And since Healthchecks runs on a separate host in a separate datacenter, you will get an alert even if your entire DC goes down.

A quick word of caution: in this specific database backup example, you still want to test the backups by restoring them regularly. There are failure modes where the backup seemingly completes successfully, but the generated database dump is invalid or incomplete.

What other things can you monitor? Here are examples that would benefit from Healthchecks-style monitoring:

  • A job that runs weekly and sends out newsletters or weekly reports
  • A job that synchronizes business data between separate systems. For example, fetches a rss feed and updates entries in a local database
  • A job that checks database replication status every minute
  • A job that updates dns entries when ip address changes
  • A job that renews a letsencrypt certificate. Alternatively, monitors the expiry status of a certificate
  • A machine that sends pings unconditionally every minute. You receive an alert when the machine loses the network connection or is powered off

Why I started Healthchecks

I started work on Healthchecks three years ago, in summer 2015. I was looking for a service like this myself. Dead Man’s Snitch and Cronitor, higher-priced Healthchecks competitors, did already exist. However, they were too expensive for the relatively unimportant things I wanted to monitor. A little arrogantly, I thought I could build something that is cheaper and better. I was also looking for an excuse to work on something fun. Compared to some of my work assignments, here I would be in complete control of the product features, the design, the technical nitty gritty, the pricing strategy and everything else. I mulled over the idea for some time. Still undecided, I started hacking on a blank Django project in June 2015. A month later, I registered the healthchecks.io domain name, and at that point, the game was on!

Timeline of Notable Events

2015–06–11 First commit.

2015–07–18 Registered the healthchecks.io domain name

2015–07–29 The website goes live, running on a single $5 DigitalOcean droplet

2015–09–30 Added Slack and HipChat integrations

2016–10–21 Published “Deploying a Django App with No Downtime”, HN: 184 points, 93 comments

2015–12–10 Braintree payments setup complete.

2016–03–31 First paying customer! $5 MRR

2016–05–10 Implemented Team Access

2016–06–07 100M processed pings

2016–08–20 While road tripping and camping in the wilderness, hchk.io goes down for 24 hours.

Side note: After this incident I bought a used Thinkpad X240 and set up a development environment on it. It now travels with me when I leave home for more than a few hours. I have been poking around the servers while sitting in a parking lot before a cross-country MTB race. The laptop is set up with full disk encryption if it gets lost or stolen. My GPG/SSH key sits on a Yubikey.

2016–09–24 200M processed pings

2016–10–31 $100 MRR

2016–12–27 Implemented Cron expression support

2017–05–04 Migration to Google Cloud Platform

2017–07–31 Finished off and published Cron Syntax Cheatsheet

2017–08–20 1 billion pings processed

2017–10–29 Migration to Hetzner. Bare metal servers.

2018–08–24 Processing around 100 pings per second. $700 MRR–still a hobby project.

Current Status

Healthchecks.io gets a dozen or so new signups per day. Most are just checking out the service. But there are also people who register and set up ten checks and ping them right away.

In September 2017, I implemented rate limiting. In summer 2018, it was off for a bit.

Currently, Healthchecks.io receives 8 million pings per day. There is rate-limiting for checks that get pinged very often. Of the daily 8 million, about 4 million get written to the database.

Most active accounts have 2–20 checks. There are quite a few heavy users too: one account has 900+ checks, another has 400+, another has 300+ checks. There are 17 accounts with over 100 checks.

The most popular notification method is email, followed by webhooks and Slack.

Profit-wise, Healthchecks is still firmly a side project for me. After bills and taxes, there is little profit left. I could cut costs by migrating to a couple of cheap VPSes, and by getting rid of the load balancer. I could severely limit the free plan, and force people to upgrade to paid plans. But by doing that, I would give up my initial goals: free for individuals, fairly priced for companies, and with a good quality service. Healthchecks would turn from “a project I love hacking on and am proud of” to “a project I do solely for money while hating myself”. So–I’m not doing that.

Future Plans

I have no big announcements to write about here. I will keep making small iterative improvements to the service. I will try and keep the code and the design as simple (think KISS) as I can. When it becomes financially viable, I will look at expanding the team-of-one, to improve the bus factor.

With that, thanks for reading! If you haven’t already, check out Healthchecks.io here. The project is also open source: you can grab the code from GitHub, change and improve it, and host your own instance.

Why Some Monthly Reports Were Empty (Or: Isn’t Coding Fun?)

If you have received a monthly report from healthchecks.io in the past few days, it might have looked like this:

Something is missing here…

The report has a header and a footer, but the actual content of the report is simply missing! For reference, this is how a monthly report is supposed to look:

This is a more useful monthly report.

So, what happened? I took me a while, but I think I have figured it out.

Monthly emails have both HTML and text versions. I noticed early on that both the HTML and the text versions are missing their content sections. The text version is shorter and simpler, so we will look at that. Here it is, simplified some more and reformatted for brevity:

<!-- Template emails/report-body-text.html -->

Hello,
This is a monthly report sent by healthchecks.io.

{% include "emails/summary-text.html %}

Cheers,
The healthchecks.io Team


<!-- Template emails/summary-text.html -->

{% load humanize %}
 Status                | Name             | Last Ping
-----------------------+------------------+------------------------------------
{% for check in checks %}
{{ check.get_status }} | {{ check.name }} | {{ check.last_ping|naturaltime }}
{% endfor %}

The child template prints out a simple ASCII-art table. I keep it separate from the parent template because this way I can reuse it in “Check XYZ is DOWN” notifications too. In the incorrect monthly reports, the child template’s content seems to be completely missing. The reports didn’t even have the “Status | Name | Last Ping” header in them. This was very puzzling to me until I learned how Django templating system handles exceptions in child templates when settings.DEBUG is set to False:

class IncludeNode(Node):
    context_key = '__include_context'

    # ... constructor etc.  ...

    def render(self, context):
        try:
            # ... template gets rendered here ...
        except Exception as e:
            if context.template.engine.debug:
                raise
            template_name = getattr(context, 'template_name', None) or 'unknown'
            warnings.warn(
                "Rendering {%% include '%s' %%} raised %s. In Django 2.1, "
                "this exception will be raised rather than silenced and "
                "rendered as an empty string." %
                (template_name, e.__class__.__name__),
                RemovedInDjango21Warning,
            )
            logger.warning(
                "Exception raised while rendering {%% include %%} for "
                "template '%s'. Empty string rendered instead.",
                template_name,
                exc_info=True,
            )
        return ''

Apparently, something in the child template was throwing an exception and causing it to be completely omitted. What could it be? A dead database connection? An SQL error? An exception in Check.get_status() method? An exception in the “naturaltime” filter?

Looking For Patterns

All emails from healthchecks.io are sent through Amazon SES. SES sends me a delivery failure notification, when there is, well, a delivery failure. Conveniently, these SES notifications also contain the original message. I have dug through the recent delivery failure notifications and found a few clues:

  • There were quite a few empty reports, but there were also many reports that looked fine
  • healthchecks.io is currently run from three app servers. Cross-checking with server logs, I found that one specific app server sent all the problematic reports. All of the good reports were sent from the other two.

The problematic app server has a special role in healthchecks.io infrastructure. With not being in the load balancer’s rotation, it is not serving web traffic. Instead, I use it for testing code changes against the production database. Let’s call this server “Canary”. The update procedure for Canary is different from that of the other two app servers. The normal update procedure is as follows:

  • Put the server in maintenance mode. It then starts signalling “I’m down” to the load balancer
  • Stop “sendalerts” and “sendreports” background tasks
  • Wait 60 seconds for the load balancer to redirect traffic from this server
  • Check out a fresh copy of the Django code, install dependencies, prepare CSS and JS bundles, copy static files, restart uwsgi, reload nginx configuration
  • Wait 10 seconds for restarts to complete
  • Take the server out of maintenance mode. Load balancer detects the server as being “up” again
  • Start “sendalerts” and “sendreports” background tasks
  • Sleep for 60 seconds so the load balancer can start sending traffic again. This step is not strictly needed. It makes sure that I wait enough time before updating the next server

Canary is not serving regular web traffic, so the update can be simple and quick:

  • Check out a fresh copy of the Django code, install dependencies, prepare CSS and JS bundles, copy static files, restart uwsgi, reload nginx configuration

Note that this shorter version does not stop or start the “sendalerts” or “sendreports” tasks.

So, here is what happened. On 18 March, I committed and deployed a backwards-incompatible database schema change, which removes the “last_ping_body” field from the Check model. I updated all three app servers to reflect this change. But the “sendreports” task on Canary did not get restarted and so kept running the previous version of the code. Whenever Canary was sending a monthly report, it would render its template. From the template, it would run an SQL query that references the now-missing “last_ping_body” field. The Django ORM would throw an exception, which the Django templating system would swallow, rendering an empty string (“”) instead. And, just like that, the end user would receive an empty report.

There were three app servers all running the “sendreports” background command. They would each wake up once a minute and check if any reports are due to be sent. One of the three was sending empty reports so, from 18 March until 22 March, about 33% of the sent monthly reports were empty.

Conclusion

I am taking several steps to make sure that a similar problem does not happen again in the future. Most importantly, Canary was not supposed to be continually running the “sendalerts” and “sendreports” tasks. I have updated the deployment scripts so that these tasks always get stopped before updating the Django app. And, for Canary, they have to be started manually.

When serious errors like bad SQL queries happen, they should be visible and loud. I will look into configuring the Django app so that it does not silently ignore exceptions from included templates. I have already made a few small changes to move database operations out of the template rendering stage. I will also be making sure that the “sendalerts” and “sendreports” tasks deliver Sentry notifications when they crash.

Apologies for the empty reports, they should not happen again!

Happy monitoring,
— Pēteris Caune, healthchecks.io

From DigitalOcean to Linode to Google Cloud Platform: the Evolution of healthchecks.io

In this article I will look at the current hosting setup of healthchecks.io, how it has evolved during the past two years, and what challenges I faced running this small but lively service.

DigitalOcean

When I first made the service public 2 years ago, it was running off a single $5/mo DigitalOcean droplet. It had a single CPU core and 512MB of RAM. The droplet was running both the Django web application and the Postgres database. Initially the service was receiving next to no traffic so everything was working well. A few months later I switched to a $20/mo droplet (two cores, 2GB RAM). I was deploying the code with a Fabric script, and had put some thought into avoiding downtime during code deploys.

Fast forward to June 2016. Feature-wise healthchecks.io was already useful, but its fault tolerance story was pretty bad. It was hosted on a single server with nightly database backups. As a first step, I decided to split up the components (database, healthchecks.io server, hchk.io server, background tasks) and run them on separate VMs. This helped some failure scenarios. For example, if the main website went down or experienced heavy traffic, hchk.io would still accept pings, and notifications would still be sent out. However, the database server going down would still be disastrous.

compose.io

The database being a single point of failure did not sit well with me, and I kept exploring options for a HA Postgres setup. It looked as if managing a fault-tolerant database cluster would be a full-time systems administration job in itself. I looked at the “pay someone else to do it” options. They were fairly limited by the tight budget I had. I could not afford Heroku Postgres with HA, for example, as it starts at $200/mo. After much consideration, I committed to go with compose.io. On paper their specs and pricing looked good, and I had tested their service with a snapshot of production data (but not with production-level traffic!) I even got to see their fail-over process in action. The database was unavailable for a few minutes during the fail-over, but it did come back up and continued to work without my intervention.

The move to compose.io did not go well. After I pointed production traffic to the new database, I soon started seeing a variety of Sentry reports about database connection problems. compose.io support advised me that my database was starved of memory, and recommended to scale its RAM allocation. Using their convenient scaling slider, I increased the RAM allocation and my monthly bill by an order of magnitude. After doing that, I was seeing fewer dropped database connections but they were still fairly regular. At this point I was:

  • getting complaints from users
  • waking up every 3 or so hours during the nights to check up on services
  • paying significantly more than originally planned
  • not getting actionable advice from compose.io support on how to troubleshoot my connection issues

Linode

Given enough time with tcpdump and WireShark I would probably have solved my dropped connection issues. But I needed to fix things fast. I gave up the HA requirement and moved the database back to a plain VPS. This time I went with Linode for two reasons:

  • it had slightly better pricing than DigitalOcean.
  • I was going to use TLS-terminating load balancers. DigitalOcean had just launched theirs. Linode’s NodeBalancers had been around for a while and seemed a safer choice. Also, they supported IPv6, DO did not.

I updated my deployment scripts, made a migration plan, did a few dry runs, slept on it, and then migrated over to Linode. Error reports ceased, and my monthly bill was in check again. But once again the database was a single point of failure. On the bright side, I was now load balancing the incoming HTTP requests, and my service could tolerate the loss of a web server node.

The traffic that hchk.io receives comes in bursts. On average, it receives around 30 requests per second, but there is a short traffic spike (hundreds of requests at once) every five minutes, a bigger spike every round hour, and a period of elevated request rate every midnight (UTC).

Traffic spikes, as seen from Postgres

hchk.io traffic also is unusual in that every request is a “one-off” and needs to do a brand new TLS handshake. I learned that a single NodeBalancer can only do about 200 TLS handshakes per second. During the traffic spikes the load balancer was becoming a bottleneck. Requests would sometimes take 3+ seconds to complete. Clients that use aggressive timeout settings would see them as failed requests. A band-aid fix was to add a second load balancer, and split the traffic between the two using round-robin DNS.

I also learned that DigitalOcean’s load balancers have similar TLS handshake performance to Linode’s so they were no good either. Looking further, I found that Google’s Cloud Load Balancer can handle as many handshakes as I could throw at it. And it had IPv6 support, albeit in alpha preview state, too! So I started plotting a move to Google Cloud Platform.

Google Cloud Platform

I went through the migration process once again, and, starting from May 4 2017 healthchecks.io has been running on Google Cloud Platform. The current setup is:

  • a managed Cloud SQL database. Postgres on Cloud SQL is currently in beta and they don’t have the HA option yet, but hopefully that is coming in the future
  • three app servers, provisioned by my plain Fabric scripts
  • Google’s Cloud Load Balancer splits traffic between the three app servers

I experimented with GKE, Google’s managed version of Kubernetes, but ultimately opted to keep things simple and straightforward: plain virtual machines and plain Fabric commands for various administrative tasks.

Once I started exploring Google Cloud Platform’s logging tools I came across concerning “502 Bad Gateway” log messages coming from the load balancer. They were infrequent and so took a long time to troubleshoot (make a configuration change — monitor logs for a few days to see if errors are gone — repeat), but I am cautiously optimistic these are now fixed for good. In short, I had to tune a number of sysctl parameters and nginx options so my app servers could properly handle bursts of new connections. The following resources helped a lot:

For updating code on app servers I am using the “rolling update” pattern: take an app server out of load balancer rotation, update it, put it back in rotation, then move on to the next app server. Here is an outline of this process for a single server:

def update():
    # Going down...
    maintenance_on()
    stop_sendalerts()
    stop_sendreports()
    print("sleeping for 120s")
    # Wait for load balancer to fail us
    time.sleep(120)

    # Actual update
    www()
    hchk()
    nginx()    

    # Coming up...
    print("sleeping for 30s")
    time.sleep(30)
    maintenance_off()
    start_sendalerts()
    start_sendreports()
    print("sleeping for 120s")
    # Wait for load balancer to declare us healthy
    time.sleep(120)    

maintenance_on() puts the server in “maintenance mode”. When in maintenance mode, the server can still process incoming requests, but it starts reporting itself as unhealthy to the load balancer, and load balancer gradually diverts traffic away from it. It takes a while for the load balancer to update, so the script waits 120 seconds before it goes ahead with updating and restarting. A complete update takes some time, but it is completely transparent for the end users … as long as I am not deploying backwards-incompatible database schema changes!

This is where healthchecks.io is now, hosting-wise. Page load times are good. No 5xx errors in the load balancer logs (fingers crossed!). The database is currently not fault tolerant but hopefully that will change in the future. Monthly bill from Google is in the $150-$200 range.

Lessons learned

When evaluating a product or service, it is imperative to test it with production-level workload. I learned this with my compose.io fiasco, and also when I hit NodeBalancer’s capacity limits.

Simple problems I can often solve myself. When I needed help with harder problems, and tried contacting Support, I was getting what I paid for, so to speak:

  • Responses from compose.io support were about as useful as Richmond’s comments on flashing lights.
  • Google charges for support separately. It starts at $150/mo for the Silver plan.
  • Linode gave me a straight and honest answer about NodeBalancer limitations, which I appreciated.

Finally, as I was looking for solutions, I explored a number of tools and technologies which I did not ultimately end up using. They all go into the “bag of tricks” and may be useful in future projects.

Meet the healthchecks.io Ops Team!

With that, thanks for reading! And, if you are not yet monitoring your cron jobs and background tasks for silent failures, I welcome you to check out healthchecks.io!

— Pēteris Caune, healthchecks.io

Cron Expressions: monitoring for jobs with fixed schedules

Cron expression support has been the most requested feature, since the launch of healthchecks.io. Long story short, it’s been implemented and is ready to use! You can now set up a time-based schedule for your checks, using the exact same syntax you use in crontab files.

For each check, you can switch between “simple” and “cron” mode:


In the simple mode, you select two parameters: period and grace time. Period is how often you expect the check to be pinged. When a ping does not arrive on time, grace time specifies how long to wait before sending an alert.

In the cron mode, you specify a cron expression, a time zone, and grace time:


The cron expression defines a fixed, time-based schedule. It allows for greater flexibility than the simple “period” parameter. For example, you can set up a check that expects a ping at the beginning of every other hour, only on weekdays. Here’s the expression you would use for that: “0 0/2 * * 1–5”.

If your server’s time zone is not UTC, you must also specify its time zone. The time zone field supports auto-complete and lets you select time zones by their IANA names. On Ubuntu systems, you can look up the system’s time zone in /etc/timezone file.

Finally, the grace time parameter works the same as in the “simple” mode. Set it to a value that comfortably covers the expected run time of your job.

Example

Let’s say you have a server that runs a backup script each morning at 6:08 AM, New York time. The backup script usually takes 1 to 2 minutes to complete and should never exceed 5 minutes. The crontab entry might look something like this:

8 6 * * * /home/user/backup.sh && curl -fsS — retry 3 https://hchk.io/fe33025a-330d-4bf0-93c4-7e433bb474da > /dev/null

For monitoring this cron job, you would set up a check as follows:

Cron expression: 8 6 * * *
Timezone: America/New_York
Grace time: 5 minutes

Notes for self-hosted installations

If you are self-hosting healthchecks.io code, there are a few things you will want to know.

Database triggers are not used any more. There used to be a management command, ensuretriggers, for creating a database trigger. The trigger would automatically update the api_check.alert_after field whenever a check is saved. This trigger is not needed any more and would interfere with cron-style checks. Remove it with the droptriggers management command:

./manage.py droptriggers

It is also a good idea to make a fresh backup of the database before major upgrades such as this one.

Conclusion

This is the initial release of cron expression support. It works well enough to be useful, but will still require careful testing, especially around daylight saving time handling. It may also see various small user interface refinements. If you use cron-style checks and notice any problems, please file an issue!

Adding cron expression support has been one of the more complex tasks since the start of the project, but it has been worth it. Since soft-launching the feature two weeks ago, 140+ new checks have already been set up to use cron expressions. This has been gratifying to see.

With that, happy monitoring and happy 2017!

Pēteris, 
healthchecks.io

Outage Postmortem — 20 August 2016

On August 20, the healthchecks.io service experienced a 24 hour outage. During this time, it was unable to process any incoming pings. This caused a large number of checks to go into the “down” state. This, in turn, caused a large number of “Your check has gone down” alerts to be sent out.

Summary

The direct cause was an unattended upgrade which restarted the PostgreSQL database process. The database restart alone should have only caused a brief downtime. Unfortunately, one of the database clients was not prepared to deal with unexpectedly closed database connections. It continuously tried to use the closed connection–and failed every time.

Meanwhile, the single maintainer of healthchecks.io website (which is me, Pēteris) was out and about, unreachable by email, and unaware of any of the incoming monitoring alerts or twitter messages.

Technical Details

The healthchecks.io service consists of the following components, each running on a separate virtual machine hosted at DigitalOcean, running Ubuntu 16.04:

  • a server running the PostgreSQL database
  • a server running the ping listener service. This is a NGINX web server reverse-proxying a small, custom Go program. Let’s call the program “hchk.go”. The hchk.io address points to this server.
  • a server running the Django website. This is a caddy web server, reverse-proxying an uwsgi process. The healthchecks.io address points to this server.
  • a server running the Django management command for sending out alerts.

Out of the box, an Ubuntu 16.04 server comes with unattended security updates enabled. Late August 19, the database server installed a security update for PostgreSQL. It then proceeded to restart the database process. During the restart, any open database connections are of course closed.

After the database restart, the website and the alerting service continued to work normally. They use Django framework and psycopg2 database driver, which together take care of opening new connections as is necessary.

The hchk.go program, however, is written in a different programming language (Go), and uses a different database driver (pgx). The program was not tested in the scenario where the database closes a connection. Its error handling for failed SQL operations was effectively a “write the error to a log file and march on”. So, when the database restarted, hchk.go was stuck with a dead connection, and returned HTTP 400 “Bad Request” messages to each and every incoming ping. A simple restart of the hchk.go process fixed the immediate problem.

To make it clear, there is nothing wrong with the Go programming language or the database driver used. They were just used incorrectly by the author of the hchk.go program (me). The reason for writing a custom Go program in the first place is performance. Before implementing the hchk.go program, server’s CPU usage was starting to get dominated by the incoming pings. CPU was mostly spent running Python code. After the switch, CPU usage dropped significantly, and currently CPU is mostly spent on TLS handshakes.

Monitoring

The healthchecks.io website has a few provisions for monitoring itself:

The monitoring worked as expected: within 5 minutes after the database restart there were alerts sent out to me. Unfortunately no-one was around for the next 24 hours to see the alerts. Had I checked my email right after the issue started, it would still have taken about 300 kilometers and 4+ hours to get to a PC with the necessary SSH keys.

Next Steps

  • Disable unattended upgrades. Done on the database server. Will do on the other servers during next upgrade cycle.
  • Fix the hchk.go application to handle closed or misbehaving database connections. Done.
  • Keep mobile roaming data enabled next time I go on a trip abroad, and read emails.
  • Set up a small laptop with development and deployment tools, and take it on the longer trips.

In closing: I apologize to all healthchecks.io users for any inconvenience caused. For a monitoring service, any downtime is unacceptable.

Being run on a shoestring budget, healthchecks.io can only offer a best effort availability. I, however, welcome the challenge and will aim to make the best of the resources available!

Pēteris Caune,
healthchecks.io

DIY SSL Certificate Expiry Monitoring

In this post, we will set up a simple SSL certificate expiry monitoring, using cron, ssl-cert-check script and a fail-safe provided by an external service, healthchecks.io.

Let’s say you administer a website, namely example.com. You want to be sure that the SSL certificate of example.com is always renewed in time. Like seldom-used passwords, the annual or bi-annual certificate expiry date is easy to forget about until it is too late. You want to set up an automated system that will remind you 30 days before the certificate expires. Finally, you want something that will “guard the guards themselves”. If your SSL monitoring setup breaks down, you want to be notified about that as well!

The main building block will be the ssl-cert-check script. On Ubuntu and Debian you can install it with a simple

apt-get install ssl-cert-check

Here are its options:

$ ssl-cert-check 
Usage: /usr/bin/ssl-cert-check [ -e email address ] [ -x days ] [-q] [-a] [-b] [-h] [-i] [-n] [-v] { [ -s common_name ] && [ -p port] } || { [ -f cert_file ] } || { [ -c certificate file ] }

-a : Send a warning message through E-mail
-b : Will not print header
-c cert file : Print the expiration date for the PEM or PKCS12 formatted certificate in cert file
-e E-mail address : E-mail address to send expiration notices
-f cert file : File with a list of FQDNs and ports
-h : Print this screen
-i : Print the issuer of the certificate
-k password : PKCS12 file password
-n : Run as a Nagios plugin
-p port : Port to connect to (interactive mode)
-s commmon name : Server to connect to (interactive mode)
-t type : Specify the certificate type
-q : Don’t print anything on the console
-v : Specify a specific protocol version to use (tls, ssl2, ssl3)
-V : Only print validation data
-x days : Certificate expiration interval (eg. if cert_date < days)

Let’s run ssl-cert-check on example.com:

$ ssl-cert-check -s example.com -p 443

Host Status Expires Days
-------------------------------- ------------ ------------ ----
example.com:443 Valid Nov 28 2018 928

Great, the status field says “Valid”. Now, let’s run it using a high certificate expiration interval (-x parameter):

$ ssl-cert-check -s example.com -p 443 -x 1000

Host Status Expires Days
-------------------------------- ------------ ------------ ----
example.com:443 Expiring Nov 28 2018 928

Perfect, the status is now “Expiring”. Now, the plan is to run this command regularly and get alerted as soon as the status is anything other than “Valid”.
The ssl-cert-check script has a few flags that will be useful inside a cron task: -q suppresses console output and -n sets the exit code, depending on the status.

$ ssl-cert-check -s example.com -p 443 -x 30 -n -q
$ echo $?
# prints 0

$ ssl-cert-check -s example.com -p 443 -x 1000 -n -q
$ echo $?
# prints 1

Next, let’s set up a check on healthchecks.io. Log in, and add a new check. Give it a descriptive name like:

A fresh healthchecks.io account for SSL certificate monitoring

Here’s how this will work: the check’s URL will need to be requested at least daily to keep it in the green “up” state. As soon as it is not requested for more than a day, its state will go to a red “down” and you will receive an email notification. If you prefer to be notified differently, in the “Integrations” section, you can set up Slack, HipChat, PagerDuty, VictorOps and Pushover notifications. You can add more email addresses to be notified, and you can also integrate with your notification systems using webhooks.

Now, with the check’s URL handy, we are ready to put together a cron command:

ssl-cert-check -s example.com -p 443 -x 30 -n -q && curl -fsS --retry 3 https://hchk.io/your-uuid-here > /dev/null

Let’s go over what this command does. First, we execute ssl-cert-check command. If the certificate is valid for at least 30 more days, it exits with exit code 0. Otherwise, it exits with a non-zero exit code.

Next, we chain the curl call using && operator. When two commands are delimited with &&, the second command only runs if the first command succeeds. In our case, if the certificate is valid, the curl command will run. If the certificate is expiring or expired, the curl command will not run.

The curl command has a few flags to suppress console output, and to retry transient HTTP failures:

  • -f, –fail Makes curl treat non-200 responses as errors
  • -s, –silent Silent or quiet mode. Don’t show progress meter or error messages
  • -S, –show-error When used with -s it makes curl show error message if it fails.
  • –retry <num> If a transient error is returned when curl tries to perform a transfer, it will retry this several times before giving up. Setting the number to 0 makes curl do no retries (which is the default). Transient error means either: a timeout, an FTP 4xx response code or an HTTP 5xx response code.

Finally, we redirect curl’s output to /dev/null. If cron runs a command and the command outputs anything to the console, cron will email the output, and here we do not want that.

You can now see how this will all work together: if the ssl-cert-check command returns success, the curl command will run and keep the check in the green “up” state. If the ssl-cert-check command returns a failure, the curl command will not be run and the check will go down. And if something happens to the whole machine running cron, the check will also go down. When the check goes down for either reason, healthchecks.io will send you an alert.

Now it is time to add this command to cron. Pick or launch an Ubuntu or Debian machine you expect to be up and running for a long time. If you have a machine that’s dedicated to doing backups and similar background jobs, that is perfect. Log in as unprivileged user and use the “crontab -e” command to edit the user’s crontab:

$ crontab -e

In the crontab editor, add this line:

20 7 * * * ssl-cert-check -s example.com -p 443 -x 30 -n -q && curl -fsS --retry 3 https://hchk.io/your-uuid-here > /dev/null

Save the file and you are done. From now on, each day at 7:20 your machine will run an SSL expiry check and then notify healthchecks.io. If the certificate expires in less than 30 days, or if the machine stops working, healthchecks.io will send you a notification.

This is an example of how you can set up simple, reasonably robust monitoring tasks. If this setup seems too hacky for your taste, if you have no appropriate machine to run cron tasks on, or if you are looking for more than just a certificate expiry check, it makes good sense to look into self-hosted or SaaS services for SSL monitoring.

Deploying a Django App with No Downtime

When healthchecks.io started to receive more than 1 request per second, it became clear I could not just go on carelessly restarting web servers after code deploys. For a monitoring service, it would be bad form to miss even a few HTTP requests. And, going forward, if the server gets busier, the problem only becomes bigger.

To give a quick overview of what I’m working with, the app is a relatively straightforward Django app, served by gunicorn behind nginx. Data lives in a PostgreSQL database. The gunicorn process and an additional background job are both managed by supervisor. It’s hosted on a single $20 DigitalOcean droplet.


Aside: With regard to technology choices, the guiding principle I’ve been following is to keep the stack as simple as is feasible for as long as possible. Adding things, like load balancers, database replication, key value store, message queue and so on, would each have certain benefits. Then on the other hand, there would also be more stuff to be managed, monitored, and kept backed up. Also, for someone new to the project, it would take more time to figure out the “ins and outs” of the system and set up everything from scratch. I see it as a nifty challenge to stay with the simple, no-frills setup, while also not compromising performance or features.


The deployment mechanism I’ve used thus far is a Fabric script plus configuration templates for supervisor and nginx. Each time I run “fab deploy” from my workstation, Fabric script does the following on the remote host:

  • sets up a new directory for the new deployment. Let’s refer to this directory as $TARGET.
  • sets up a python3 virtualenv in $TARGET/venv
  • fetches the latest snapshot of code from GitHub into $TARGET. It is convenient to use GitHub’s Subversion interface for this and run a “svn export” command. It produces just the source files without any version control metadata–exactly what’s needed.
  • installs dependencies listed in requirements file. These get installed into the new virtualenv and don’t affect the live application. Downloading and building the dependencies take up to a minute.
  • runs Django management commands to collect static files, run database migrations etc.
  • rewrites the supervisor configuration file to run gunicorn from the new virtual environment
  • updates nginx configuration, in case I’ve changed anything in the nginx configuration template
  • runs “supervisorctl reload” and “/etc/init.d/nginx restart”. At this point the web application becomes unavailable and remains unavailable until supervisor starts back up, launches gunicorn process, and the Django code initializes. This usually takes 5 to 10 seconds, and nginx would typically return “502 Bad Gateway” responses during this time.
  • All done!

Here’s how the relevant part of Fabric script looks. The virtualenv context manager seen below is from the excellent fabtools library.

def deploy():
    """ Checks out code, prepares venv, runs management commands,
    updates supervisor and nginx configuration. """

    now = datetime.datetime.today()
    now_string = now.strftime("%Y%m%d-%H%M%S")
    project_dir = "/home/hc/webapps/hc-%s" % now_string
    venv_dir = os.path.join(project_dir, "venv")

    svn_url = "https://github.com/healthchecks/healthchecks/trunk"
    run("svn export %s %s" % (svn_url, project_dir))

    with cd(project_dir):
        run("virtualenv --python=python3 --system-site-packages venv")
        # local_settings.py is where things like access keys go
        put("local_settings.py", ".")
        put("newrelic.ini", ".")

        with virtualenv(venv_dir):
            run("pip install -U gunicorn raven newrelic")
            run("pip install -r requirements.txt")
            run("python manage.py collectstatic --noinput")
            run("python manage.py compress")

            with settings(user="hc"):
                run("python manage.py migrate")
                run("python manage.py ensuretriggers")
                run("python manage.py clearsessions")

    switch(project_dir)

def switch(project_dir):
    # Supervisor
    upload_template("supervisor/hc.conf.tmpl",
                    "/etc/supervisor/conf.d/hc.conf",
                    context=locals(),
                    backup=False,
                    use_sudo=True)

    upload_template("supervisor/hc_sendalerts.conf.tmpl",
                    "/etc/supervisor/conf.d/hc_sendalerts.conf",
                    context=locals(),
                    backup=False,
                    use_sudo=True)

    # Nginx
    upload_template("nginx/hc.conf.tmpl",
                    "/etc/nginx/sites-enabled/hc.conf",
                    context=locals(),
                    backup=False,
                    use_sudo=True)

    sudo("supervisorctl reload")
    sudo("/etc/init.d/nginx reload")

Now, how to eliminate the downtime during the last steps of each deploy?Let’s set some constraints: no load balancer (for now anyway). Everything runs off a single box, and even a single non-200 response is undesirable. And, baby steps: I will consider the simple (and common) case when there are no database migrations to be applied or they are backwards-compatible: the old version of the app keeps working acceptably after the migrations are applied.

The first idea I looked into was based on the observation that availability is more important for some parts of the app than others. Specifically, the API part of the app listens for pings from the monitored client systems, and the frontend part serves pages to normal website visitors. While it would be embarrassing to show error pages to human visitors, not missing any pings is actually more important. A missed ping can lead to a false alert being sent sometime later. That’s even more embarrassing!

I considered and prototyped listening to pings using Amazon API Gateway. It would put ping messages in Amazon SQS queue, which the Django app could consume at its leisure. This would be a relatively simple way to improve availability and scalability by quite a lot at the cost of somewhat increased complexity and a new external dependency. I might look into this again in future.

Next idea: separate the “listen to pings” functionality from the rest of the Django app. The ping listener logic is very simple and, ultimately, amounts to two SQL operations: one update and one insert. It could be easy enough to rewrite this part, perhaps using one of the python microframeworks, or maybe using a language other than Python, or maybe even handle it from nginx itself, using ngx_postgres module. For a little amusement, here’s the nginx configuration fragment which, basically, works as-is (please forgive the funny looking regular expression):

location ~ ^/(\w\w\w\w\w\w\w\w-\w\w\w\w-\w\w\w\w-\w\w\w\w-\w\w\w\w\w\w\w\w\w\w\w\w)/?$ {
    add_header Content-Type text/plain;

    postgres_pass   database;
    postgres_output value;

    postgres_escape $ip $remote_addr;
    postgres_escape $agent =$http_user_agent;
    postgres_escape $body =$request_body;

    postgres_query "
        WITH t AS (
            UPDATE api_check
            SET last_ping=now()
            WHERE code='$1'
            RETURNING id, last_ping
        )
        INSERT INTO api_ping
            (created, remote_addr, method, ua, body, owner_id, scheme)
        SELECT
            last_ping, $ip, '$request_method', $agent, $body, id, '$scheme'
        FROM t
        RETURNING 'OK'
    ";

    postgres_rewrite no_changes 400;
}

Here’s what’s going on: when the client requests and the URL of a certain format, the server runs a PostgreSQL query and returns either HTTP code 200 or HTTP code 400. This is also a performance win, because the request doesn’t have to travel through the hoops of gunicorn, Django and psycopg2. As long as the database is available, nginx can handle the ping requests, even if the Django application is not running for any reason.

The not so great thing with this approach is that it’s “tricky” and adds to the number of things that the developer and systems administrator need to know. For example, when the database schema changes, the SQL query above might need to be updated and tested as well. Getting the ngx_postgres extension set up isn’t a simple matter of “apt-get install” either.

Thinking more about it, the main goal of zero downtime can also be achieved by just carefully orchestrating process restarts and reloads.

My deployment script was using “/etc/init.d/nginx restart” because I didn’t know any better. As I learned, it can be replaced it with “/etc/init.d/nginx reload” which handles things gracefully:

Run service nginx reload or /etc/init.d/nginx reload

It will do a hot reload of the configuration without downtime. If you have pending requests, then there will be lingering nginx processes that will handle those connections before it dies, so it’s an extremely graceful way to reload configs.

– “Nginx config reload without downtime” on ServerFault

Similarly, my deployment script was using “supervisorctl reload” which stops all managed services, re-reads configuration, and starts all services. Instead “supervisorctl update” can be used to start, stop and restart the changed tasks as necessary.

Now, here’s what “fab deploy” can do:

  • set up a new virtual environment as before
  • create a supervisor task with unique name (“hc_timestamp”)
  • start the new gunicorn process alongside the running one. nginx talks to gunicorn processes using UNIX sockets, and each process uses a separate, again timestamped, socket file
  • wait a little–then verify that the new gunicorn process has started up and is serving responses
  • update nginx configuration to point to the new socket file and reload nginx
  • stop the old gunicorn process

Here’s the improved part of Fabric script which juggles supervisor jobs:

def switch(tag, project_dir):
    # Supervisor
    supervisor_conf_path = "/etc/supervisor/conf.d/hc_%s.conf" % tag
    upload_template("supervisor/hc.conf.tmpl",
                    supervisor_conf_path,
                    context=locals(),
                    backup=False,
                    use_sudo=True)

    upload_template("supervisor/hc_sendalerts.conf.tmpl",
                    "/etc/supervisor/conf.d/hc_sendalerts.conf",
                    context=locals(),
                    backup=False,
                    use_sudo=True)

    # Starts up gunicorn from the new virtualenv
    sudo("supervisorctl update")

    # Give it some time to start up
    time.sleep(5)

    # Let's check the new server is nominally working
    # gunicorn listens on UNIX socket so this is a bit contrived:
    l = ("GET /about/ HTTP/1.0\\r\\n"
         "Host: healthchecks.io\\r\\n"
         "\\r\\n")

    cmd = 'echo -e "%s" | nc -U /tmp/hc-%s.sock' % (l, tag)
    # Look for known string in response. If it's not found, something
    # is wrong with the new deployment and we abort
    assert "Monkey See Monkey Do" in run(cmd, quiet=True)

    # nginx
    upload_template("nginx/hc.conf.tmpl",
                    "/etc/nginx/sites-enabled/hc.conf",
                    context=locals(),
                    backup=False,
                    use_sudo=True)

    sudo("/etc/init.d/nginx reload")

    # should be live now - remove supervisor conf for previous versions
    s = sudo("for i in /etc/supervisor/conf.d/*.conf; do echo $i; done")
    for line in s.split("\n"):
        line = line.strip()
        if line == supervisor_conf_path:
            continue
        if line.startswith("/etc/supervisor/conf.d/hc_2"):
            sudo("rm %s" % line)

    # This stops gunicorn processes
    sudo("supervisorctl update")

With this, nginx is always serving requests, and is talking to a live gunicorn process at all times. To verify this in practice, I wrote a quick script that requests a particular URL again and again in an infinite loop. As soon as it hits a non-200 response, it would print out a hard-to-miss error message. With this banging against my test VM, I did a couple deploys and saw no missed requests. Success!

Summary

There are many ways to achieve zero downtime during code deploys, and each has its own trade-offs. For example, a reasonable strategy is to extract the critical parts out of the bigger application. Each part can then be updated independently. Later, the parts can also be scaled independently. The downside to this is more code and configuration to maintain.

What I ultimately ended up doing:

  • hot-reload supervisor and nginx configurations instead of just restarting them. Obvious thing to do in retrospect.
  • make sure the new gunicorn process is alive and being used by nginx before stopping the old gunicorn process.
  • and keep the whole setup relatively simple. As the project gets more usage, I will need to look at performance hotspots and figure out how to scale horizontally, but this should do for now!

How to Integrate healthchecks.io with PagerDuty

PagerDuty is a well-known incident management system. It provides alerting, on-call scheduling, escalation policies and incident tracking. If you use or plan on using PagerDuty, you can can integrate it with your healthchecks.io account in few simple steps!

Step 1: Add a new service to your PagerDuty account

Log into your PagerDuty account, go to Configuration > Services, and click on “Add New Service”:

Adding service to PagerDuty account

Give it a descriptive name, and select “Use our API directly”. Click on “Add Service” and take note of its API key:

List of services on PagerDuty

Step 2: Add a notification channel to your Healthchecks account

Log into your Healthchecks account, and go to Channels. In “Add Notification Channel” section, select “PagerDuty” and paste the API key from previous step:

Adding a Notification Channel in healthchecks.io

… and done! From now on, when a check goes down, Healthchecks will open a new incident in your PagerDuty account. When the check goes back up again, Healthchecks will resolve the incident. Simple and easy!

Intro


I needed a tool to alert me when my cron jobs silently fail. There is already a number of existing services for this, but it seemed like a fun thing to build myself. So I present to you: healthchecks.io

I am using this myself and it has already been useful for me a couple times. Say, a seemingly benign code change in one service causes my batch job to fail 12 hours later, in the middle of night. Without any monitoring I might be blissfully unaware for days or months, until I need those backups or whatever, but now I get an email alert and can get it sorted in minutes. Sweet!

I licensed this under BSD licence, hoping it might be useful for other people too. It’s such a simple service it feels wrong to charge big bucks for it. You can grab the code from GitHub, run it, extend it, add unicorns or raptors, and so on. Or you can use the hosted service which is free. I cannot make guarantees that I’ll keep the hosted service around for ten years, though. The running costs for me currently are: $5/mo DigitalOcean box, two domain names and SSL certificates, a bit of space on S3 for daily backups, and my own time for maintaining this.

On implementation side, it’s a pretty straightforward Django app with nothing particularly clever going on, which is a good thing. It does make use of a database trigger (which works with both PostgreSQL and MySQL), and it has neat JS horizontal slider widgets for setting duration parameters.

Now, about future plans. After about two months of spare time hacking, I feel healthchecks.io is well into MVP stage. It is already useful for me as-is. There’s a list of features I’m considering, but I also want to keep the code base simple, with few dependencies, and easy to deploy. I may add bits for reliability, integrations with other services, but probably no big new features like active checks (think Pingdom), status pages (statuspage.io), installable monitoring agents (NewRelic and many others), and so on. Keep it simple!