Deploying a Django App with No Downtime

When healthchecks.io started to receive more than 1 request per second, it became clear I could not just go on carelessly restarting web servers after code deploys. For a monitoring service, it would be bad form to miss even a few HTTP requests. And, going forward, if the server gets busier, the problem only becomes bigger.

To give a quick overview of what I’m working with, the app is a relatively straightforward Django app, served by gunicorn behind nginx. Data lives in a PostgreSQL database. The gunicorn process and an additional background job are both managed by supervisor. It’s hosted on a single $20 DigitalOcean droplet.

Aside: With regard to technology choices, the guiding principle I’ve been following is to keep the stack as simple as is feasible for as long as possible. Adding things, like load balancers, database replication, key value store, message queue and so on, would each have certain benefits. Then on the other hand, there would also be more stuff to be managed, monitored, and kept backed up. Also, for someone new to the project, it would take more time to figure out the “ins and outs” of the system and set up everything from scratch. I see it as a nifty challenge to stay with the simple, no-frills setup, while also not compromising performance or features.

The deployment mechanism I’ve used thus far is a Fabric script plus configuration templates for supervisor and nginx. Each time I run “fab deploy” from my workstation, Fabric script does the following on the remote host:

sets up a new directory for the new deployment. Let’s refer to this directory as $TARGET.
sets up a python3 virtualenv in $TARGET/venv
fetches the latest snapshot of code from GitHub into $TARGET. It is convenient to use GitHub’s Subversion interface for this and run a “svn export” command. It produces just the source files without any version control metadata–exactly what’s needed.
installs dependencies listed in requirements file. These get installed into the new virtualenv and don’t affect the live application. Downloading and building the dependencies take up to a minute.
runs Django management commands to collect static files, run database migrations etc.
rewrites the supervisor configuration file to run gunicorn from the new virtual environment
updates nginx configuration, in case I’ve changed anything in the nginx configuration template
runs “supervisorctl reload” and “/etc/init.d/nginx restart”. At this point the web application becomes unavailable and remains unavailable until supervisor starts back up, launches gunicorn process, and the Django code initializes. This usually takes 5 to 10 seconds, and nginx would typically return “502 Bad Gateway” responses during this time.
All done!

Here’s how the relevant part of Fabric script looks. The virtualenv context manager seen below is from the excellent fabtools library.

def deploy():
    """ Checks out code, prepares venv, runs management commands,
    updates supervisor and nginx configuration. """

    now = datetime.datetime.today()
    now_string = now.strftime("%Y%m%d-%H%M%S")
    project_dir = "/home/hc/webapps/hc-%s" % now_string
    venv_dir = os.path.join(project_dir, "venv")

    svn_url = "https://github.com/healthchecks/healthchecks/trunk"
    run("svn export %s %s" % (svn_url, project_dir))

    with cd(project_dir):
        run("virtualenv --python=python3 --system-site-packages venv")
        # local_settings.py is where things like access keys go
        put("local_settings.py", ".")
        put("newrelic.ini", ".")

        with virtualenv(venv_dir):
            run("pip install -U gunicorn raven newrelic")
            run("pip install -r requirements.txt")
            run("python manage.py collectstatic --noinput")
            run("python manage.py compress")

            with settings(user="hc"):
                run("python manage.py migrate")
                run("python manage.py ensuretriggers")
                run("python manage.py clearsessions")

    switch(project_dir)

def switch(project_dir):
    # Supervisor
    upload_template("supervisor/hc.conf.tmpl",
                    "/etc/supervisor/conf.d/hc.conf",
                    context=locals(),
                    backup=False,
                    use_sudo=True)

    upload_template("supervisor/hc_sendalerts.conf.tmpl",
                    "/etc/supervisor/conf.d/hc_sendalerts.conf",
                    context=locals(),
                    backup=False,
                    use_sudo=True)

    # Nginx
    upload_template("nginx/hc.conf.tmpl",
                    "/etc/nginx/sites-enabled/hc.conf",
                    context=locals(),
                    backup=False,
                    use_sudo=True)

    sudo("supervisorctl reload")
    sudo("/etc/init.d/nginx reload")

Now, how to eliminate the downtime during the last steps of each deploy?Let’s set some constraints: no load balancer (for now anyway). Everything runs off a single box, and even a single non-200 response is undesirable. And, baby steps: I will consider the simple (and common) case when there are no database migrations to be applied or they are backwards-compatible: the old version of the app keeps working acceptably after the migrations are applied.

The first idea I looked into was based on the observation that availability is more important for some parts of the app than others. Specifically, the API part of the app listens for pings from the monitored client systems, and the frontend part serves pages to normal website visitors. While it would be embarrassing to show error pages to human visitors, not missing any pings is actually more important. A missed ping can lead to a false alert being sent sometime later. That’s even more embarrassing!

I considered and prototyped listening to pings using Amazon API Gateway. It would put ping messages in Amazon SQS queue, which the Django app could consume at its leisure. This would be a relatively simple way to improve availability and scalability by quite a lot at the cost of somewhat increased complexity and a new external dependency. I might look into this again in future.

Next idea: separate the “listen to pings” functionality from the rest of the Django app. The ping listener logic is very simple and, ultimately, amounts to two SQL operations: one update and one insert. It could be easy enough to rewrite this part, perhaps using one of the python microframeworks, or maybe using a language other than Python, or maybe even handle it from nginx itself, using ngx_postgres module. For a little amusement, here’s the nginx configuration fragment which, basically, works as-is (please forgive the funny looking regular expression):

location ~ ^/(\w\w\w\w\w\w\w\w-\w\w\w\w-\w\w\w\w-\w\w\w\w-\w\w\w\w\w\w\w\w\w\w\w\w)/?$ {
    add_header Content-Type text/plain;

    postgres_pass   database;
    postgres_output value;

    postgres_escape $ip $remote_addr;
    postgres_escape $agent =$http_user_agent;
    postgres_escape $body =$request_body;

    postgres_query "
        WITH t AS (
            UPDATE api_check
            SET last_ping=now()
            WHERE code='$1'
            RETURNING id, last_ping
        )
        INSERT INTO api_ping
            (created, remote_addr, method, ua, body, owner_id, scheme)
        SELECT
            last_ping, $ip, '$request_method', $agent, $body, id, '$scheme'
        FROM t
        RETURNING 'OK'
    ";

    postgres_rewrite no_changes 400;
}

Here’s what’s going on: when the client requests and the URL of a certain format, the server runs a PostgreSQL query and returns either HTTP code 200 or HTTP code 400. This is also a performance win, because the request doesn’t have to travel through the hoops of gunicorn, Django and psycopg2. As long as the database is available, nginx can handle the ping requests, even if the Django application is not running for any reason.

The not so great thing with this approach is that it’s “tricky” and adds to the number of things that the developer and systems administrator need to know. For example, when the database schema changes, the SQL query above might need to be updated and tested as well. Getting the ngx_postgres extension set up isn’t a simple matter of “apt-get install” either.

Thinking more about it, the main goal of zero downtime can also be achieved by just carefully orchestrating process restarts and reloads.

My deployment script was using “/etc/init.d/nginx restart” because I didn’t know any better. As I learned, it can be replaced it with “/etc/init.d/nginx reload” which handles things gracefully:

Run service nginx reload or /etc/init.d/nginx reload
It will do a hot reload of the configuration without downtime. If you have pending requests, then there will be lingering nginx processes that will handle those connections before it dies, so it’s an extremely graceful way to reload configs.
– “Nginx config reload without downtime” on ServerFault

Similarly, my deployment script was using “supervisorctl reload” which stops all managed services, re-reads configuration, and starts all services. Instead “supervisorctl update” can be used to start, stop and restart the changed tasks as necessary.

Now, here’s what “fab deploy” can do:

set up a new virtual environment as before
create a supervisor task with unique name (“hc_timestamp”)
start the new gunicorn process alongside the running one. nginx talks to gunicorn processes using UNIX sockets, and each process uses a separate, again timestamped, socket file
wait a little–then verify that the new gunicorn process has started up and is serving responses
update nginx configuration to point to the new socket file and reload nginx
stop the old gunicorn process

Here’s the improved part of Fabric script which juggles supervisor jobs:

def switch(tag, project_dir):
    # Supervisor
    supervisor_conf_path = "/etc/supervisor/conf.d/hc_%s.conf" % tag
    upload_template("supervisor/hc.conf.tmpl",
                    supervisor_conf_path,
                    context=locals(),
                    backup=False,
                    use_sudo=True)

    upload_template("supervisor/hc_sendalerts.conf.tmpl",
                    "/etc/supervisor/conf.d/hc_sendalerts.conf",
                    context=locals(),
                    backup=False,
                    use_sudo=True)

    # Starts up gunicorn from the new virtualenv
    sudo("supervisorctl update")

    # Give it some time to start up
    time.sleep(5)

    # Let's check the new server is nominally working
    # gunicorn listens on UNIX socket so this is a bit contrived:
    l = ("GET /about/ HTTP/1.0\\r\\n"
         "Host: healthchecks.io\\r\\n"
         "\\r\\n")

    cmd = 'echo -e "%s" | nc -U /tmp/hc-%s.sock' % (l, tag)
    # Look for known string in response. If it's not found, something
    # is wrong with the new deployment and we abort
    assert "Monkey See Monkey Do" in run(cmd, quiet=True)

    # nginx
    upload_template("nginx/hc.conf.tmpl",
                    "/etc/nginx/sites-enabled/hc.conf",
                    context=locals(),
                    backup=False,
                    use_sudo=True)

    sudo("/etc/init.d/nginx reload")

    # should be live now - remove supervisor conf for previous versions
    s = sudo("for i in /etc/supervisor/conf.d/*.conf; do echo $i; done")
    for line in s.split("\n"):
        line = line.strip()
        if line == supervisor_conf_path:
            continue
        if line.startswith("/etc/supervisor/conf.d/hc_2"):
            sudo("rm %s" % line)

    # This stops gunicorn processes
    sudo("supervisorctl update")

With this, nginx is always serving requests, and is talking to a live gunicorn process at all times. To verify this in practice, I wrote a quick script that requests a particular URL again and again in an infinite loop. As soon as it hits a non-200 response, it would print out a hard-to-miss error message. With this banging against my test VM, I did a couple deploys and saw no missed requests. Success!

Summary

There are many ways to achieve zero downtime during code deploys, and each has its own trade-offs. For example, a reasonable strategy is to extract the critical parts out of the bigger application. Each part can then be updated independently. Later, the parts can also be scaled independently. The downside to this is more code and configuration to maintain.

What I ultimately ended up doing:

hot-reload supervisor and nginx configurations instead of just restarting them. Obvious thing to do in retrospect.
make sure the new gunicorn process is alive and being used by nginx before stopping the old gunicorn process.
and keep the whole setup relatively simple. As the project gets more usage, I will need to look at performance hotspots and figure out how to scale horizontally, but this should do for now!