Why Some Monthly Reports Were Empty (Or: Isn’t Coding Fun?)

If you have received a monthly report from healthchecks.io in the past few days, it might have looked like this:

Something is missing here…

The report has a header and a footer, but the actual content of the report is simply missing! For reference, this is how a monthly report is supposed to look:

This is a more useful monthly report.

So, what happened? I took me a while, but I think I have figured it out.

Monthly emails have both HTML and text versions. I noticed early on that both the HTML and the text versions are missing their content sections. The text version is shorter and simpler, so we will look at that. Here it is, simplified some more and reformatted for brevity:

<!-- Template emails/report-body-text.html -->

Hello,
This is a monthly report sent by healthchecks.io.

{% include "emails/summary-text.html %}

Cheers,
The healthchecks.io Team


<!-- Template emails/summary-text.html -->

{% load humanize %}
 Status                | Name             | Last Ping
-----------------------+------------------+------------------------------------
{% for check in checks %}
{{ check.get_status }} | {{ check.name }} | {{ check.last_ping|naturaltime }}
{% endfor %}

The child template prints out a simple ASCII-art table. I keep it separate from the parent template because this way I can reuse it in “Check XYZ is DOWN” notifications too. In the incorrect monthly reports, the child template’s content seems to be completely missing. The reports didn’t even have the “Status | Name | Last Ping” header in them. This was very puzzling to me until I learned how Django templating system handles exceptions in child templates when settings.DEBUG is set to False:

class IncludeNode(Node):
    context_key = '__include_context'

    # ... constructor etc.  ...

    def render(self, context):
        try:
            # ... template gets rendered here ...
        except Exception as e:
            if context.template.engine.debug:
                raise
            template_name = getattr(context, 'template_name', None) or 'unknown'
            warnings.warn(
                "Rendering {%% include '%s' %%} raised %s. In Django 2.1, "
                "this exception will be raised rather than silenced and "
                "rendered as an empty string." %
                (template_name, e.__class__.__name__),
                RemovedInDjango21Warning,
            )
            logger.warning(
                "Exception raised while rendering {%% include %%} for "
                "template '%s'. Empty string rendered instead.",
                template_name,
                exc_info=True,
            )
        return ''

Apparently, something in the child template was throwing an exception and causing it to be completely omitted. What could it be? A dead database connection? An SQL error? An exception in Check.get_status() method? An exception in the “naturaltime” filter?

Looking For Patterns

All emails from healthchecks.io are sent through Amazon SES. SES sends me a delivery failure notification, when there is, well, a delivery failure. Conveniently, these SES notifications also contain the original message. I have dug through the recent delivery failure notifications and found a few clues:

  • There were quite a few empty reports, but there were also many reports that looked fine
  • healthchecks.io is currently run from three app servers. Cross-checking with server logs, I found that one specific app server sent all the problematic reports. All of the good reports were sent from the other two.

The problematic app server has a special role in healthchecks.io infrastructure. With not being in the load balancer’s rotation, it is not serving web traffic. Instead, I use it for testing code changes against the production database. Let’s call this server “Canary”. The update procedure for Canary is different from that of the other two app servers. The normal update procedure is as follows:

  • Put the server in maintenance mode. It then starts signalling “I’m down” to the load balancer
  • Stop “sendalerts” and “sendreports” background tasks
  • Wait 60 seconds for the load balancer to redirect traffic from this server
  • Check out a fresh copy of the Django code, install dependencies, prepare CSS and JS bundles, copy static files, restart uwsgi, reload nginx configuration
  • Wait 10 seconds for restarts to complete
  • Take the server out of maintenance mode. Load balancer detects the server as being “up” again
  • Start “sendalerts” and “sendreports” background tasks
  • Sleep for 60 seconds so the load balancer can start sending traffic again. This step is not strictly needed. It makes sure that I wait enough time before updating the next server

Canary is not serving regular web traffic, so the update can be simple and quick:

  • Check out a fresh copy of the Django code, install dependencies, prepare CSS and JS bundles, copy static files, restart uwsgi, reload nginx configuration

Note that this shorter version does not stop or start the “sendalerts” or “sendreports” tasks.

So, here is what happened. On 18 March, I committed and deployed a backwards-incompatible database schema change, which removes the “last_ping_body” field from the Check model. I updated all three app servers to reflect this change. But the “sendreports” task on Canary did not get restarted and so kept running the previous version of the code. Whenever Canary was sending a monthly report, it would render its template. From the template, it would run an SQL query that references the now-missing “last_ping_body” field. The Django ORM would throw an exception, which the Django templating system would swallow, rendering an empty string (“”) instead. And, just like that, the end user would receive an empty report.

There were three app servers all running the “sendreports” background command. They would each wake up once a minute and check if any reports are due to be sent. One of the three was sending empty reports so, from 18 March until 22 March, about 33% of the sent monthly reports were empty.

Conclusion

I am taking several steps to make sure that a similar problem does not happen again in the future. Most importantly, Canary was not supposed to be continually running the “sendalerts” and “sendreports” tasks. I have updated the deployment scripts so that these tasks always get stopped before updating the Django app. And, for Canary, they have to be started manually.

When serious errors like bad SQL queries happen, they should be visible and loud. I will look into configuring the Django app so that it does not silently ignore exceptions from included templates. I have already made a few small changes to move database operations out of the template rendering stage. I will also be making sure that the “sendalerts” and “sendreports” tasks deliver Sentry notifications when they crash.

Apologies for the empty reports, they should not happen again!

Happy monitoring,
— Pēteris Caune, healthchecks.io