Pēteris Caune

Notice: RU IP Block Starting January 1, 2023

Starting from January 1, 2023, our ping endpoints (hc-ping.com, hchk.io) will reject requests from the Russian Federation IP addresses. The web dashboard at https://healthchecks.io will remain accessible and functional. We’ve sent email notifications to the affected user accounts.

The rejected ping requests will receive HTTP 403 responses. We will be using the GeoIP2 database to geo-locate IP addresses. To test a particular IP address, you can use the GeoIP2 demo tool. If your address gets geolocated incorrectly, please let us know.

Using Run IDs to Track Run Times Of Overlapping Jobs

Healthchecks.io recently got a new feature: run IDs.

RID=`uuidgen`
# Send a "start" signal, specify ?rid query parameter
curl https://hc-ping.com/your-uuid-here/start?rid=$RID
# ... do some work here ...
# Send a "success" signal, specify the same ?rid value
curl https://hc-ping.com/your-uuid-here?rid=$RID

Run IDs are client-chosen UUID values that the client can optionally add as a “rid” query parameter to any ping URL (success, /start, /fail, /log, /{exitcode}).

What are run IDs useful for? Healthchecks.io uses them to group events from a single “run”, and calculate correct run durations. This is most important in cases where multiple instances of the same job can run simultaneously, and partially or fully overlap. Consider the following sequence of events:

  1. 12:00 start
  2. 12:01 start
  3. 12:05 success
  4. 12:06 success

Without run IDs, we cannot tell if the second success event corresponds to the first or the second start event. But with run IDs we can:

  1. 12:00 start, rid=05f4aa48…
  2. 12:01 start, rid=7671e111…
  3. 12:05 success, rid=05f4aa48…
  4. 12:06 success, rid=7671e111…

The usage of run IDs is completely optional. You don’t need them if your jobs always run sequentially. If you do use run IDs, make sure that:

  • the rid values are valid UUIDs in the canonical text form (with no curly braces, with no uppercase letters)
  • you use the same rid value for matching start and success/fail events

Healthchecks.io will show the run IDs in a shortened form in the “Events” section:

In the above image, note how the execution times are available for both “success” events. If the run IDs were not used, event #4 would not show an execution time since it is not directly preceded by a “start” event.

Alerting Logic When Using Run IDs

Healthchecks.io monitors the time gaps between “start” and “success” signals: if a job sends a “start” signal, but then does not send a “success” signal within its configured grace time, Healthchecks.io will assume the job has failed and notify you. However, if multiple instances of the same job run concurrently, Healthchecks.io will only monitor the run time of the most recently started run and alert you when it exceeds the grace time. Under the hood, each check tracks a single “last started” value, which gets overwritten with every received “start” event.

To illustrate, let’s assume the grace time of 1 minute, and look at the above screenshot again. Event #4 ran for 6 minutes 39 seconds and so overshot the time budget of 1 minute. But Healthchecks.io generated no alerts because the most recently started run finished within the time limit (it took 37 seconds, which is less than 1 minute).

Happy monitoring!
–Pēteris

Schedule Cronjob for the First Monday of Every Month, the Funky Way

The crontab man page (“man 5 crontab” or read online) contains this bit:

Note: The day of a command’s execution can be specified by two fields — day of month, and day of week. If both fields are restricted (i.e., don’t start with *), the command will be run when either field matches the current time. For example, 30 4 1,15 * 5 would cause a command to be run at 4:30 am on the 1st and 15th of each month, plus every Friday.

What does it mean precisely? If you specify both the day of month and the day of week field, then cron will run the command when either of the fields match. In other words, there’s a logical OR relationship between the two fields. Let’s look at an example:

0 0 1 * MON

This expression translates to “Run at the midnight if it’s the first day of the month OR Monday”. An expression like this could be handy for a job that sends weekly and monthly reports as it probably needs to run at the start of every week, and the start of every month. However, if either field is unrestricted, the logical relationship between the fields changes to “AND”. For example:

0 0 * * MON

Here, the day of month field is unrestricted. Cron will run this command when both the day of month field AND the day of week fields match. Since * matches any day of month, this expression effectively translates to “Run at midnight every Monday”.

So far so good! The sometimes-OR relationship between the two fields is a relatively well-known cron gotcha. But let’s look closer at what values cron considers “unrestricted”. Star (*) is of course unrestricted, but, according to the man page, any value that starts with the star character is also unrestricted. For example, */2 is unrestricted too. Can we think of any useful schedules that exploit this fact? Yes, we can:

0 0 1-7 * */7

Here, the day of month is restricted to dates 1 to 7. Cron will interpret */7 in the day of week field as “every 7 days starting from 0 (Sunday)”, so, effectively, “Every Sunday”. Since the day of week field starts with *, cron will run the command on dates 1 to 7 which are also Sunday. In other words, this will match midnight of the first Sunday of every month.

In the above example, */7 is a trick to say “every Sunday” in a way that starts with the star. Unfortunately, this trick only works for Sunday. Can we make an expression that runs on, say, the first Monday of every month? Yes, we can!

0 0 */100,1-7 * MON

The day of month field here is */100,1-7, meaning “every 100 days starting from date 1, and also on dates 1-7”. Since there are no months with 100+ days, this again is a trick to say “on dates 1 to 7” but with a leading star. Because of the star, cron will run the command on dates 1 to 7 that are also Monday.

OK, but does any of this work? Is the man page accurate? Yes: you can check the cron source here and see how it initializes the DOW_STAR and DOM_STAR flags by testing just the first character of the fields. I’ve also tested both expressions empirically by setting up dummy cron jobs and monitoring when they run. I ran them in a VM with an accelerated clock, which I’ve used for experiments before.

An important caveat before you use these tricks for scheduling your tasks: there are many systems that support cron-like syntax for scheduling tasks. It’s a safe bet not all of them implement all the quirks of the classic cron. Always check if your scheduler supports the syntax and logic you are planning to use. And always monitor if your scheduled tasks do run at the expected times (wink wink)!

Healthchecks.io Hosting, Questions and Answers

The article Healthchecks.io Hosting Setup, 2022 Edition was recently on Hacker News. I was answering questions in the comments section. Here’s a recap of some of the questions and my answers. I’ve edited some of the questions for clarity.

Q: Can you share more details about what the 4 HAProxy servers are doing?

  • The traffic from the monitored systems comes with spikes. Looking at netdata graphs, currently the baseline is 600 requests/s, but there is a 2000 requests/s spike every minute, and 4000 requests/s spike every 10 minutes.
  • Want to maintain redundancy and capacity even when a load balancer is removed from DNS rotation (due to network problems, or for upgrade).

There are spare resources on the servers, especially RAM, and I could pack more things on fewer hosts. But, with Hetzner prices, why bother? 🙂

Q: Why Braintree, not Stripe?

When I started, Stripe was not yet available in my country, Latvia (it now is).

Personally I’ve had good experience with Braintree. Particularly their support has been impressively good – they take time to respond, but you can tell the support agents have deep knowledge of their system, they have access to tools to troubleshoot problems, and they don’t hesitate to escalate to engineering.

Q: I’d like to hear more on your usage of SSLMate and SOPs.

SSLMate it is a certificate reseller with a convenient (for me) interface – a CLI program. It’s no fun copy-pasting certificates from email attachments.

I’m using both RSA and ECDSA certificates (RSA for compatibility with old clients, ECDSA for efficiency). I’m not sure but looks like ECDSA is not yet generally available from Let’s Encrypt.

On sops: the secrets (passwords, API keys, access tokens) are sitting in an encrypted file (“vault”). When a Fabric task needs secrets to fill in a configuration file template, it calls sops to decrypt the vault. My Yubikey starts flashing, I tap the key, the Fabric task receives the secrets and can continue.

Q: I would love to hear more detail how WireGuard is set up.

I use vanilla Wireguard (the wg command and the wg-quick service). I set up new hosts and update peer configuration using Fabric tasks. It may sound messy, but works fine in practice. For example, to set up Wireguard on a new host:

  • On the new host, I run a Fabric task which generates a key pair and spits out the public key. The private key never leaves the server.
  • I paste the public key in a peer configuration template.
  • On every host that must be able to contact the new host, I run another Fabric task which updates the peer configuration from the template (wg syncconf).

One thing to watch out is any services that bind to the Wireguard network interface. I had to make sure on reboot they start after wg-quick.

Q: I am curious how sites like this handle scheduled tasks that have to run at high frequencies? Cron on one machine? Celery beat?

Healthchecks runs a loop of

10 send any due notifications
20 SLEEP 2
30 GOTO 10

The actual loop is of course a little more complicated, and is being run concurrently on several machines.

Q: How did you go about implementing the integrations (email, Signal, Discord….)?

Started with just the email integration, and added other integration types over time, one by one. A few were contributed as GitHub PRs.

The Signal one took by far the most effort to get going. But, for ideological reasons, I really wanted to have it 🙂 Unlike most other services, Signal doesn’t have public HTTP API for sending messages. Instead you have to run your own local Signal client and send messages through it. Healthchecks is using signal-cli.

Q: What volume of data are you storing in PostgreSQL? Any reason not to use a hosted PostgreSQL provider?

Around 200 write tx/s as a baseline. Spikes to 2000 write tx/s at the start of every minute, and 4000 write tx/s every 10 minutes.

Not using a hosted PostgreSQL provider for several reasons:

  • Cost
  • Schrems II
  • From what I remember, both Google Cloud SQL and AWS RDS used to have mandatory maintenance windows. The fail-over was not instant, so there was some unavoidable downtime every month. This was a while ago – maybe it is different now.

Q: Is the decision not to use Patroni for HA PostgreSQL in this case, so that you don’t add more complexity?

Yes. Plus, from reading database outage postmortems, I was not comfortable making the “do we fail-over now?” decision automatic. Think about the brownouts, where the primary is still up, but slow. Or it experiences intermittent packet loss.

I’ve automated the mechanics of the fail-over, but it still must be initiated manually.

Q: I’m getting the impression, the bus factor at Healthchecks.io seems to be 1. If I’d run a one man show type of business, I’d love to have some kind of plan B in case I’d be incapacitated for more than half a day.

Yes, the bus factor is 1, and it’s bugging me too. I think any realistic plan B involves expanding the team.

Q: How much does it all cost?

I don’t have a precise number, but somewhere in the €800/mo region.

Q: How do you think open-sourcing the self-hosted version of your product impacted your sales? Positively, negatively?

I can’t say definitely, but my gut feeling is positively.

What if another operator takes the source code, and starts a competing commercial service? I’ve seen very few (I think 1 or 2) instances of somebody attempting a commercial product based on Healthchecks open-source code. I think that’s because it’s just a lot of work to run the service professionally, and then even more work to find users and get people to pay for it.

What if a potential customer decides to self-host instead? I do see a good amount of enthusiasts and companies self-hosting their private Healthchecks instance. I’m fine with that. For one thing, the self-hosting users are all potential future clients of the hosted service. They are already familiar and happy with the product, I just need to sell the “as a service” part.

We Moved Some Data to S3

When clients make HTTP POST requests to ping URLs, Healthchecks captures and stores request body data. You can use this feature to log a command’s output and have it available for inspection later:

$ cowsay hello | curl --data-binary @- https://hc-ping.com/some-uuid-here

Same thing, using runitor instead of curl:

$ runitor -uuid some-uuid-here -- cowsay hello

You can view the request body data in the web UI:

Healthchecks also captures and stores email messages, when pinging by email:

There is a limit to how much data gets stored. The limit used to be 10KB. For example, if a client sends 50KB in an HTTP POST request body, Healthchecks would store the first 10KB, and ignore the remaining 40KB. I recently bumped up the size limit to 100KB. Users can now attach 10x more log information to every HTTP POST request, and HTML-heavy email messages are now less likely to get chopped off in the middle.

In theory, the limit change could have been as simple as adding one zero to a configuration parameter, but in practice, there was a little bit more to it!

Database vs S3

Healthchecks used to store request body data in its primary and only data store, a PostgreSQL database. Bumping up the limit and throwing more data in the database would work in the short term, but would create problems in the long run. Backup sizes and processing times would grow at a quicker rate. Network I/O to the database server would also increase, and sooner become a bottleneck.

Now, how about outsourcing ping body storage to AWS S3? This would allow bumping up the size limit without ballooning the database size (yay!). On the other hand, this would add a new moving part to the system, and increase code and operational complexity (oh no!). But perhaps still worth it?

Healthchecks would be doing lots of small S3 PUT requests, and AWS S3 has per-request fees. Quick napkin math: AWS charges $0.005 per 1000 PUT requests. Let’s say we’re uploading 20 objects to S3 per second. That’s 20 * 60 * 60 * 24 * 30 = 52M PUT requests per month, or $260 added to the AWS bill. AWS also charges for bandwidth and storage. And what about Schrems II? There could be personal data in ping bodies, so we would need to encrypt them before handing them off to AWS.

Luckily there are alternate, S3-compatible object storage providers, some of them based in the EU, and some of them charge no per-request fees! Scaleway and OVH looked like two promising candidates.

Sync vs Async Uploads

OK, let’s dive into implementation decisions. When Healthchecks receives a ping, should it upload request body data to S3 right away, in the HTTP request-response cycle? Or should it stash the request body data somewhere and have a background process deal with the uploads?

The synchronous approach is simple operationally (no background processing to worry about), but the S3 upload operations can slow down the request-response cycle.

The async approach is more fiddly to set up. The background worker process can throw an exception or grow a backlog of jobs, it needs to be monitored. On the upside, any S3 API hiccups or slowdowns would not affect the ping handler’s throughput.

Easy solution–I implemented both methods! The open-source Healthchecks project uploads ping bodies synchronously. But on the hosted service (healthchecks.io), the ping handler stores received pings on the filesystem, and a separate worker process picks them up and uploads them to S3.

Homegrown API requests vs S3 Client Library

Moving forward, how does one upload an object to an S3 bucket? I’ve used boto3 in the past, but how hard could it possibly be to send the right payload to the right API endpoint?

Well, by the time I got a small request signing experiment to work, I decided I’ll use a library after all! I picked minio-py as the S3 client library. It is smaller and has fewer dependencies than boto3.

Upgrade From Storing Strings to Storing Bytes

If the ping body is just a few bytes in size, does it still make sense to offload its storage to S3? Probably not. There should be some threshold value (say, 100 bytes), below which ping bodies still get stored in the database.

Any data that we put or retrieve from object storage we will treat as binary. But the “body” field in the Healthchecks database has historically been a text field, only appropriate for storing Unicode strings.

To avoid the inconsistency of storing short ping bodies as Unicode strings, and longer ping bodies as binary data, I added a new “body_raw” binary field in the database and updated the application code to use it by default.

Object Key Naming Scheme

What naming scheme to use for keys in the S3 bucket? The most straightforward naming scheme would be /<uuid>/<n>:

  • /504eb741-1966-49fe-a6e7-4d3133d2b2bd/1
  • /504eb741-1966-49fe-a6e7-4d3133d2b2bd/2
  • /504eb741-1966-49fe-a6e7-4d3133d2b2bd/3
  • /504eb741-1966-49fe-a6e7-4d3133d2b2bd/100
  • /504eb741-1966-49fe-a6e7-4d3133d2b2bd/101

Here “uuid” would be the unique UUID of a check, and “n” is the serial number of the received ping: “1” for the first received ping, “2” for the second received ping, and so on.

Now, let’s say we are cleaning up old objects and want to delete all objects with uuid=504eb741-1966-49fe-a6e7-4d3133d2b2bd and n<50. How to do that? With the above naming scheme, we could:

  1. Retrieve a list of all objects with the prefix /504eb741-1966-49fe-a6e7-4d3133d2b2bd/.
  2. Filter the list, keeping only the entries with n<50
  3. Then run the DeleteObjects API call and pass the filtered list to it.

I noticed the list_objects call has an optional start_after argument, perhaps it can be used to avoid the client-side filtering (step 2)?

Yes, it can – if we add specially crafted sorting prefixes to the object keys:

  • /504eb741-1966-49fe-a6e7-4d3133d2b2bd/zi-1
  • /504eb741-1966-49fe-a6e7-4d3133d2b2bd/zh-2
  • /504eb741-1966-49fe-a6e7-4d3133d2b2bd/zg-3
  • /504eb741-1966-49fe-a6e7-4d3133d2b2bd/xijj-100
  • /504eb741-1966-49fe-a6e7-4d3133d2b2bd/xiji-101

If we want all keys with n<50, we can now do:

list_objects(prefix="504eb741-...-4d3133d2b2bd/", start_after="yej")

Exercise time: looking at just the above examples, can you work out how the zi, zh etc. prefixes are generated, and why this works?

If you are interested, here is the function that generates the sorting prefix.

Boolean Serialization Issue

I ran into an issue when using the minio-py’s remove_objects call: when generating a request XML, it was serializing boolean values as True and False, instead of true and false. When testing, this was accepted by AWS S3 API, but both Scaleway and OVH were rejecting these requests as invalid.

  • I filed an issue with minio-py, and they fixed the code to serialize boolean values to lowercase strings.
  • I reported the issue to Scaleway and OVH, both fixed their S3 implementation to accept capitalized boolean values.

Object Storage Cleanup

Let’s say a user is closing their Healthchecks account, and we want to delete their data. With Django and relational databases, it is remarkably easy to do:

user.delete()  # that's it

Django will delete the user record from the auth_user table, and will also take care of deleting all dependent objects: projects, checks, channels, pings, notifications, etc. All of that, with one line of code!

For the S3 object storage though we will need to take care of data cleanup ourselves. I wrote a pruneobjects management command which iterates through the S3 bucket and removes all objects referencing checks that do not exist in the database anymore.

Testing Object Storage Providers

I initially planned to use Scaleway Object Storage. I contacted their support and got a confirmation that my planned use case is reasonable. As I was using Scaleway to test my work-in-progress code, I saw their DeleteObjects API calls were rather slow. They would often take seconds, and sometimes tens of seconds to complete. Around that time Scaleway object storage also happened to have a multi-hour outage. API calls were returning “InternalError” responses, the dashboard was not working.

I switched my focus to OVH. Same as with Scaleway, I contacted OVH support and described my use case and planned usage patterns. I explicitly asked about API request rates, they said–no limits. I set up the account and got busy testing. The API operations seemed significantly quicker. DeleteObjects would typically complete in a sub-second.

I did run into several hopefully teething troubles with OVH too. API would sometimes return “ServiceUnavailable, Please reduce your request rate.” OVH would acknowledge the issue with this masterpiece of an explanation:

The problem you have encountered is due to occasional operations that have taken place on the platform.

When the number of objects in the bucket went above 500’000, OVH dashboard couldn’t display the bucket’s contents anymore. The page would take a long time to load and eventually display “Internal server error”. This issue has not been resolved yet. But the API works.

“Ping Body Not Yet Available” Special Case

If ping bodies are being uploaded asynchronously, we can run into a situation where we want to show the ping body to the user, but it is still sitting in a queue, waiting to be uploaded to S3. Here’s an example scenario:

  • Client sends a “fail” event with data in the request body.
  • Ping handler registers the ping and adds the body data to the upload queue.
  • Milliseconds later, the “sendalerts” process sees the failure and prepares an email notification. It needs the ping body, which is not present in the S3 bucket yet.

Note that the ping handler and sendalerts may be running on different machines, so sendalerts cannot peek in the upload queue either.

My “good enough” solution for this was to add a conditional delay to the email sending logic:

  • Fetch the request body from S3.
  • If not found, wait 5 seconds, then fetch it again.
  • If still nothing, use a “The request body data is being processed” fallback message in the email.

The idea here is that request bodies usually upload quickly. Assuming normal operation and no significant backlog, 5 seconds should be plenty. But if the request body is still not available after the 5 seconds, we don’t want to delay the email notification too much, and use the fallback message.

S3 Backup

In theory, OVH claims a 100% resilience rate for their object storage service. But we know entire data centers can and sometimes do burn down, and ultimately it is our responsibility to be able to recover the data. My S3 backup solution is a cron job on a dedicated VPS, doing the following:

  • Download entire contents of the bucket using “aws s3 sync”.
  • Pack the files together using tar, encrypt them with gpg, and upload the resulting file to a different bucket at a different provider.
#!/bin/bash

SRC_ENDPOINT=https://s3.sbg.perf.cloud.ovh.net
BUCKET=***
TIMESTAMP=`date +"%Y%m%d-%H%M%S"`
DST_PATH=s3://***/objects-$TIMESTAMP.tar.gpg

set -e

runitor -uuid *** \
    -- aws --profile src --endpoint-url $SRC_ENDPOINT s3 sync s3://$BUCKET $BUCKET --delete

tar -cf - $BUCKET | gpg --encrypt --recipient 832DDD6E | aws --profile dst s3 cp - $DST_PATH

The “aws” command is provided by the awscli tool. s3cmd also has a “sync” command, but in my testing, it could not handle a bucket with hundreds of thousands of objects.

The “n % 50 == 0” Bug

As I was working on implementing S3 backup, I noticed that the bucket contains more data than I was expecting. Some checks had 8000 and more ping bodies stored. How?

The cleanup logic for asynchronous uploads is:

  • Pick a ping body from the queue, upload it to S3.
  • If the ping’s serial number is divisible by 50, run a cleanup routine.

The idea is to run the cleanup routine every 50 pings. Now, what happens if the client sends alternating “start” events as HTTP GET requests, and “success” events as HTTP POST with a request body? We can have a situation where every POST has an odd serial number, and so our cleanup routine never runs! My “good enough” fix here was to change the constant “50” to a non-even number.

The 10KB to 100KB Limit Increase

With the above in place, I added OVH to the list of sub-processors in the Privacy Policy, increased the ping body limit to 100KB and gradually rolled out the changes to production servers. After several days of testing to see if everything is coping well, I announced the limit increase on Twitter.

Here are graphs from Netdata showing the object uploads per second, and the backlog size, aggregated across all web servers:

And that’s how “we” moved some data to S3. Thanks for reading!
–Pēteris.

Using OpenSMTPD as a Local Relay-Only MTA

I recently made a change to how Healthchecks sends transactional email. Before:

The Healthchecks Django app is directly connecting to a 3rd-party SMTP relay (think AWS SES, SendGrid, Mailgun, but in our specific case it is Elastic Email), and sends SMTP commands over a TLS-encrypted connection. If the send operation fails, the Django app retries a couple times, then gives up and the email is lost.

After:

A local OpenSMTPD instance runs on the same machine as the Django app. It accepts connections from local clients only, and relays all received messages to the external SMTP relay operated by, in our case, Elastic Email.

In this setup, the Django app can quickly hand off the outgoing emails to OpenSMTPD, and OpenSMTPD retries failed sends for minutes, hours or even days. If the 3rd-party SMTP relay has an outage, emails are not lost, just delayed. At least that’s the theory – we shall see how well this works in practice.

The OpenSMTPD configuration for this use case is surprisingly compact:

table secrets file:/etc/mail/secrets
listen on lo inet4 port 25
action "relay" relay host smtp+tls://smtp@external.smtp.host:587 auth <secrets>
match for any action "relay"

And /etc/mail/secrets contains:

smtp external-smtp-username:external-smtp-password

I also experimented with Postfix (as recommended here), and it gets the job done too. I also considered more lightweight relay-only MTAs: dma and nullmailer. Neither supports listening on port 25, instead you enqueue emails by piping data to /usr/sbin/sendmail. This complicates integration with the Django app somewhat. I ultimately went with OpenSMTPD because it seemed to have the right balance of features and simplicity.

Healthchecks.io Hosting Setup, 2022 Edition

Here’s the summary of the hardware and the software that powers Healthchecks.io.

Hardware

Since 2017, Healthchecks.io runs on dedicated servers at Hetzner. The current lineup is:

  • HAProxy servers: 4x AX41-NVMe servers (Ryzen 3600, 6 cores)
  • Web servers: 3x AX41-NVMe servers (Ryzen 3600, 6 cores)
  • PostgreSQL servers: 2x AX101 servers (Ryzen 5950X, 16 cores)

All servers are located in the Falkenstein data center park, scattered across the FSN-DCx data centers so they are not all behind the same core switch. The monthly Hetzner bill is €484.

Software

  • Ubuntu 20.04 on all machines.
  • Systemd manages services that need to run continuously (haproxy, nginx, postgresql, etc.)
  • Wireguard for private networking between the servers. Tiered topology: HAProxy servers cannot talk to PostgreSQL servers.
  • Netdata agent for monitoring the machines and the services running on them. Connected to Netdata Cloud for easy overview of all servers.
  • HAProxy 2.2 for terminating TLS connections, and load balancing between app servers. Enables easy rolling updates of application servers.
  • PostgreSQL 13, streaming replication from primary to standby. No automatic failover: I can trigger failover with a single command, but the decision is manual.

On app servers:

  • uWSGI runs the Healthchecks Python application (web frontend, management API).
  • hchk, a small application written in Go, handles ping API (hc-ping.com) and inbound email.
  • NGINX handles rate limiting, static file serving, and reverse proxying to uWSGI and hchk.

SaaS Tools

  • AWS S3 for storing encrypted database backups.
  • Braintree for accepting payments and managing subscriptions.
  • Cloudflare for hosting DNS records.
  • Elastic Email for sending transactional email.
  • Fastmail for sending and receiving support email.
  • GitHub for version control and tracking issues, and GitHub Actions for running tests on every commit.
  • Hardypress for blog.healthchecks.io (static WordPress blog as-a-service).
  • HetrixTools for uptime monitoring.
  • IcoMoon for authoring icon fonts.
  • pgDash for monitoring PostgreSQL servers. Here’s a blog post about setting it up.
  • PingPong for powering status.healthchecks.io (service status, incidents, planned downtimes, performance metrics).
  • SSLMate for provisioning certificates from command-line.
  • Syften for getting notifications when Healthchecks is mentioned on HN, Twitter, Reddit and elsewhere.
  • Twilio for sending SMS, WhatsApp and phone call notifications.

Cron Jobs

Healthchecks.io, the cron job monitoring service, uses cron jobs itself for the following periodic tasks:

  • Once a day, make a full database backup, encrypt it with gpg, and upload it to AWS S3.
  • Once a day, send “Your account is inactive and is about to be deleted” notifications to inactive users.
  • Once a day, send “Your subscription will renew on …” for annual subscriptions that are due in 1 month.

Bonus – Development and Deployment Setup

  • My main dev machine is a desktop PC with a single 27″ 1440p display.
  • Ubuntu 20.04, GNOME Shell.
  • Sublime Text for editing source code. A combination of meld, Sublime Merge and command-line git for working with git.
  • Yubikeys for signing git commits and logging into servers.
  • Fabric scripts for deploying code and running maintenance tasks on servers.
  • sops for storing secrets.
  • A dedicated laptop inside a dedicated backpack, for dealing with emergencies while away from the main PC.

Comments, questions, ideas? Let me know via email or on Twitter!

How to Send Email From Cron Jobs

Let’s say you are writing a shell script for a systems housekeeping task. Perhaps the script is uploading backups to a remote server, or it is cleaning up old data, or it is making a measurement and submitting it to a somebody else’s HTTP API. What is the least cumbersome way for the script to contact you in case of problems? Let’s say you want to be contacted via email – how to make it work?

Cron and MAILTO

If you run the script from cron, you can look into using cron’s MAILTO= option. You put a MAILTO=you@example.org line in your crontab, and, when a job fails, cron will send a notification to the specified address using system’s MTA. OK, then, what MTA to use?

sSMTP

sSMTP is a send-only MTA which hands off messages to an external SMTP server that you configure in /etc/ssmtp/ssmtp.conf. For example, you can create a dedicated Gmail address for sending notifications, and use it in ssmtp.conf like so:

Root=username@gmail.com
Mailhub=smtp.gmail.com:465
AuthUser=username@gmail.com
AuthPass=google-app-password-here
UseTLS=YES

Gmail-specific note: for the Gmail SMTP service to accept your credentials, you will need to set up and use an app password. To use app passwords in your Google account, you will also need to set up 2-step-verification. The app password must be guarded with the same care as your account’s main password, so putting it in ssmtp.conf is not ideal. This is why I would strongly recommend to use a separate Gmail account, not your main account, for this.

With sSMTP installed and configured, sSMTP will pass messages on to Gmail, and Gmail will deliver them to your inbox. This will make cron’s MAILTO option just work, and you can then also send messages from shell scripts using the mailx program:

echo "Hello world" | mailx -s "Subject goes here" you@example.org

Similar sSMTP alternatives are nullmailer and msmtp.

I used Gmail as example, but if you use a different email provider, it likely provides an SMTP interface as well. Transactional email services also typically provide an SMTP interface. For example, I’ve used sSMTP with Mailgun, and it works great. In short, sSMTP needs working SMTP credentials – it does not matter if they are from Fastmail, Zoho, AWS SES, Sendgrid, Mailjet or something else.

HTTP API

The transacional email services usually provide a HTTP API. Depending on the provider, the API can be so simple you can use it with a simple curl call. Here’s an example with the already mentioned Mailgun:

curl --user "api:your-api-key-here" \
     https://api.mailgun.net/v3/yourdomain.com/messages \
     -F from='sender@yourdomain.com' \
     -F to='you@example.org' \
     -F subject='Subject goes here' \
     -F text='Message body goes here'

The upside of this approach is you don’t need to install and configure anything on your server beforehand (assuming curl is preinstalled). One downside is you cannot use this method with cron’s MAILTO, as curl is not a MTA. But you can use this method just fine from scripts.

HTTP API is also easy to use from a Python script:

import os
import requests

def email(subject, text):
    url = "https://api.mailgun.net/v3/yourdomain.com/messages"
    auth = ("api", os.getenv("MAILGUN_KEY"))
    data = {
        "from": "sender@yourdomain.com",
        "to": "you@example.org",
        "subject": subject,
        "text": text,
    }

    requests.post(url, auth=auth, data=data)

Hosted Log Management Systems

Here, I’m thinking of systems like LogDNA and Papertrail. You configure your system’s syslog to ship system logs to your chosen log management system. In that system, you set up alerting rules like “alert me if this specific keyword appears in logs this many times in this long time window”. And that is all!

Logging to syslog from scripts is easy using the logger command:

logger Hello World!

As an example, here’s a notification I received from Papertrail a few days ago. It runs a saved search once per hour and sends a notification if the search produces any results:

Cron Monitoring Services

And here I am of course thinking of Healthchecks.io, but there are good alternatives too. The monitoring service provides an unique URL that the housekeeping script must request regularly. When an HTTP request from the script does not arrive on time, Healthchecks.io detects that and alerts you. This takes care of scenarios, where, for example, the server has been shut down, and so is unable to contact you by itself. Healthchecks.io offers a lot more features, but this is the basic idea.

Here’s an example notification I received from Healthchecks.io. My home router is pinging Healthchecks.io. When home connection goes down, I get an alert:

In summary:

  • sSMTP is a good choice for use in combination with cron’s MAILTO option
  • HTTP APIs are handy for sending emails from scripts, without needing to install or configure any additional software
  • Log management systems and cron job monitoring systems can send email notifications, but they specialize in specific tasks. Log management systems can notify you about patterns in log files, cron job monitoring systems can detect irregularities in the heartbeat signals sent by your background tasks.

How Debian Cron Handles DST Transitions

When the Daylight saving time starts, and the computer’s clock jumps forward, what does cron do? If the clock jumps from 1AM to 2AM, and there is a job scheduled for 1:30AM, will cron run this job? If yes, when? Likewise, when the Daylight saving time ends, and the clock jumps backward, will cron possibly run the same scheduled job twice?

Let’s look at what “man cron” says. On my Debian-based system, under the “Notes” section:

Special considerations exist when the clock is changed by less than 3 hours, for example at the beginning and end of daylight savings time. If the time has moved forwards, those jobs which would have run in the time that was skipped will be run soon after the change. Conversely, if the time has moved backwards by less than 3 hours, those jobs that fall into the repeated time will not be re-run.

Only jobs that run at a particular time (not specified as @hourly, nor with ‘*’ in the hour or minute specifier) are affected. Jobs which are specified with wildcards are run based on the new time immediately.

Clock changes of more than 3 hours are considered to be corrections to the clock, and the new time is used immediately.

After a fair bit of experimenting, I can say the above is mostly accurate. But it takes some explaining, at least it did for me. Debian cron distinguishes between “fixed-time jobs” and “wildcard jobs”, and handles them differently when the clock jumps forward or backward.

Wildcard jobs. Here’s a specific example: */10 * * * *, or, in human words, “every 10 minutes”. Debian cron will try to maintain even 10-minute intervals between each execution.

Fixed-time jobs. Now consider 30 1 * * *, or “run at 1:30AM every day”. Here, the special DST handling logic will kick in. If the clock jumps an hour forward from 1AM to 2AM, Debian cron will execute the job at 2AM. And, if the clock jumps from 2AM to 1AM, Debian cron will not run the job again at the second 1:30AM.

What are the precise rules for distinguishing between wildcard jobs and fixed-time jobs? Let’s look at the source code!

Source

Debian cron is based on Vixie cron, but it adds Debian-specific feature and bugfix patches on top. The special DST handling logic is one such patch. I found Debian cron source code at salsa.debian.org/debian/cron/. Here is the DST patch: Better-timeskip-handling.patch.

Unless you are already familiar with cron source, to understand the patch, you would want to see it in context. We can apply Debian patches in the correct order using the quilt tool:

$ git clone https://salsa.debian.org/debian/cron.git
$ cd cron
$ QUILT_PATCHES=debian/patches quilt push -a

Now we can read through entry.c and cron.c and learn how they work. My C skills are somewhere at the FizzBuzz level so this is a little tricky. Anyway, it looks like cron parses the expression one character at a time. At every step, it knows how far into the expression it is, whether it is parsing a number, a range, a symbolic month reference, and so on. If the first character of the minute or the hour specifier is the wildcard, it sets the MIN_STAR or HR_STAR flags. It later uses these flags to decide whether to use the special DST handling logic.

Here’s what this means for specific examples:

  • * 1 * * * (every minute from 1:00 to 1:59) is a wildcard expression because the minute specifier is “*”.
  • 15 * * * * (at 15 minutes past every hour) is a wildcard expression because the hour specifier is “*”.
  • 15 */2 * * * (at 0:15, 2:15, 4:15, …) is also a wildcard expression because the hour specifier starts with “*”.
  • 0-59 1 * * * (every minute from 1:00 to 1:59) is not a wildcard expression because neither the minute specifier nor the hour specifier starts with “*”.

Quite interesting! But I am not a C compiler (gasp!), and my interpretation may very well be wrong. Let’s test this experimentally, by actually running Debian cron. And, since we are impatient, let’s speed up time using QEMU magic.

QEMU Magic

I followed these instructions to install Debian in QEMU. I then launched QEMU with the following parameters:

$ qemu-system-x86_64 -nographic -m 8G -hda test.img -icount shift=0,align=off,sleep=off -rtc base=2021-03-27,clock=vm

The -icount (instruction counter) parameter is the main hero here. By setting align=off,sleep=off we allow the emulated system’s clock to run faster than real-time – as fast as the host CPU can manage. We can also tweak the shift parameter for even faster time travel (read QEMU man page for more on this).

Inside the emulated system, I set the system timezone to “Europe/Dublin”, and added my test entries in root’s crontab. I tested many different expressions, but let’s look at the following two – the first one is a wildcard job, and the second one is a fixed-time job right in the middle of the DST transition window for Europe/Dublin:

$ crontab -l
30 * * * * logger -t experiment1 `date -Iseconds`
30 1 * * * logger -t experiment2 `date -Iseconds`

For the “Europe/Dublin” timezone, the year 2021, the daylight saving time started on March 28, 1AM. The clock moved 1 hour forward. Let’s see how Debian cron handles it:

$ journalctl --since "2021-03-27" -t experiment1  
[...]
Mar 27 23:30:01 debian experiment1[1016]: 2021-03-27T23:30:01+00:00
Mar 28 00:30:01 debian experiment1[3456]: 2021-03-28T00:30:01+00:00
Mar 28 02:30:01 debian experiment1[3866]: 2021-03-28T02:30:01+01:00
Mar 28 03:30:01 debian experiment1[3887]: 2021-03-28T03:30:01+01:00
[...]

We can see the wildcard job ran 30 minutes past every hour, but the entry for 1:30 is missing. This is because this time “doesn’t exist”, the local time skipped from 00:59 directly to 02:00. Now let’s look at the fixed-time job:

$ journalctl --since "2021-03-27" -t experiment2
Mar 27 01:30:01 debian experiment2[366]: 2021-03-27T01:30:01+00:00
Mar 28 02:00:01 debian experiment2[3849]: 2021-03-28T02:00:01+01:00
Mar 29 01:30:01 debian experiment2[4551]: 2021-03-29T01:30:01+01:00    
[...]

On March 28, the job was scheduled to run at 01:30, but instead, it was run at 02:00. This is Debian cron’s special DST handling in action: “If the time has moved forwards, those jobs which would have run in the time that was skipped will be run soon after the change.

Now let’s look at October 2021. For the “Europe/Dublin” timezone, the daylight saving time ends on October 31, 2AM. The clock is moved 1 hour back.

$ journalctl --since "2021-10-30" -t experiment1
[...]
Oct 31 00:30:01 debian experiment1[1166]: 2021-10-31T00:30:01+01:00
Oct 31 01:30:01 debian experiment1[1191]: 2021-10-31T01:30:01+01:00
Oct 31 01:30:01 debian experiment1[1212]: 2021-10-31T01:30:01+00:00
Oct 31 02:30:01 debian experiment1[1233]: 2021-10-31T02:30:01+00:00
[...]

In this one, it appears as if the wildcard job ran twice at 1:30. But, if you look closely at the ISO8601 timestamp, you can see the timezone offsets are different. The first run was before the DST transition, then the clock moved 1 hour back, and the second run happened an hour later. Debian cron maintains a regular cadence for wildcard jobs (60 minutes for this job). Now, the fixed-time job:

$ journalctl --since "2021-10-30" -t experiment2
Oct 30 01:30:01 debian experiment2[444]: 2021-10-30T01:30:01+01:00
Oct 31 01:30:01 debian experiment2[1192]: 2021-10-31T01:30:01+01:00
Nov 01 01:30:01 debian experiment2[1950]: 2021-11-01T01:30:01+00:00    
[...]

The fixed-time job was executed once at 01:30 but was not run again an hour later. This is again thanks to the special DST handling: “if the time has moved backwards by less than 3 hours, those jobs that fall into the repeated time will not be re-run“.

Let’s also check if Debian cron treats 0-59 1 * * * as a wildcard or a fixed-time job.

$ crontab -l
0-59 1 * * * logger -t experiment3 `date -Iseconds`

$ journalctl --since "2021-03-27" -t experiment3
[...]
Mar 27 01:57:01 debian experiment3[598]: 2021-03-27T01:57:01+00:00
Mar 27 01:58:01 debian experiment3[602]: 2021-03-27T01:58:01+00:00
Mar 27 01:59:01 debian experiment3[606]: 2021-03-27T01:59:01+00:00
Mar 28 02:00:01 debian experiment3[1218]: 2021-03-28T02:00:01+01:00
Mar 28 02:00:01 debian experiment3[1222]: 2021-03-28T02:00:01+01:00
Mar 28 02:00:01 debian experiment3[1226]: 2021-03-28T02:00:01+01:00
[...]

On March 27, the job ran at minute intervals, but on March 28 the runs are all bunched up at 02:00. In other words, Debian cron treated this as a fixed-time job and applied the special handling.

I’ve found the QEMU setup to be a handy tool for checking assumptions and hypotheses about cron’s behavior. Thanks to the accelerated clock, experiments take minutes or hours, not days or weeks.

Who Cares, and Closing Notes

Who cares about all this? Well – I do! Healthchecks.io is a cron job monitoring service, its cron handling logic needs to be as robust and correct as possible.

Like many other Python projects, Healthchecks used croniter for handling cron expressions. It did not seem viable to fix DST handling bugs in croniter, so I started a separate library, cronsim. It is smaller, quicker, and tested against Debian cron with 5000+ different cron expressions.

Ah, but why target Debian cron and not some other cron implementation? To be honest, primarily because I happen to use Ubuntu (a Debian derivative) on all my systems. I also suspect Debian and its derivatives together have a large if not the largest server OS market share, so it is a reasonable target.

One final note: there is a simple alternative to dealing with the DST complexity. Use UTC on your servers!

That’s all for now, thanks for reading!

New Feature: Slug URLs

Healthchecks.io pinging API has always been based on UUIDs. Each Check in the system has its own unique and immutable UUID. To send a success signal (“ping”), clients make a request to https://hc-ping.com/ with check’s UUID added at the end:

https://hc-ping.com/d1665499-e827-441f-bf13-8f15e8f4c0b9

To signal a start, a failure, or a particular exit status, clients can add more bits after the UUID:

https://hc-ping.com/d1665499-e827-441f-bf13-8f15e8f4c0b9/start
https://hc-ping.com/d1665499-e827-441f-bf13-8f15e8f4c0b9/fail
https://hc-ping.com/d1665499-e827-441f-bf13-8f15e8f4c0b9/123

This is conceptually simple and has worked quite well. It requires no additional authentication. The UUID value is the authentication, and the UUID “address space” is so vast nobody is going to find valid ping URLs by random guessing any time soon.

Still, UUID-based ping URLs have downsides too.

UUIDs are not particularly human-friendly. Unless you are good at memorizing UUIDs, it is not easy to associate a ping URL with a check just by looking at it. But it is easy to make mistakes when copy/pasting UUIDs around.

Each UUID is a secret. Therefore, if you have many things to monitor, you must keep many secrets. Let’s consider a specific example: a web application that does various housekeeping tasks on a schedule. Each housekeeping task has a corresponding Check in Healthchecks.io and a ping URL. The web app stores its configuration, including ping URLs, in environment variables: FOO_TASK_URL, BAR_TASK_URL, and so on. This is all well and good. But, as the web app grows and adds new types of housekeeping tasks, the number of environment variables can get out of hand. In one specific project I’m working on, there are already 15 environment variables for storing ping URLs, and there will likely be more. Wouldn’t it be nice if there was a way to store just a single secret, and derive all ping URLs from it?

Introducing: Slug URLs

In slug URLs, we replace the UUID with two components, a ping key and a slug:

https://hc-ping.com/<ping-key>/<slug>

Here’s a concrete example:

https://hc-ping.com/fqOOd6-F4MMNuCEnzTU01w/db-backups

Slug URLs support start and failure signals the same way as UUID URLs do:

https://hc-ping.com/fqOOd6-F4MMNuCEnzTU01w/db-backups/start
https://hc-ping.com/fqOOd6-F4MMNuCEnzTU01w/db-backups/fail
https://hc-ping.com/fqOOd6-F4MMNuCEnzTU01w/db-backups/123

All checks in a single project share the same ping key. You can look up or generate the ping key in your project’s Settings screen, right next to your project’s API keys:

Healthchecks.io derives slug from Check’s name using Django’s slugify function. The slugify function applies the following transformations:

  • Converts to ASCII.
  • Converts to lowercase.
  • Removes characters that aren’t alphanumerics, underscores, hyphens, or whitespace.
  • Replaces any whitespace or repeated hyphens with single hyphens.
  • Removes leading and trailing whitespace, hyphens, and underscores.

Here are a few specific examples of check names and the resulting slugs:

NameSlug
DB Backupdb-backup
Backup /opt/some/pathbackup-optsomepath
server1 -> server2server1-server2

Going back to the web app with housekeeping tasks example, with slug URLs the web app would need to store just one secret – the pinging key, and would be able to construct all ping URLs off that. Here is a rough example in Python:

// foo_task.py
import os
import requests

// ... do some work here ...

requests.get("https://hc-ping.com/%s/foo_task" % os.getenv("PING_KEY"))

The ping URLs are also more human-friendly. The slug part helps you tell them apart.

Q&A

Q: How can I use slug URLs in my Healthchecks.io project?
A: First, generate the ping key in your project’s Settings page. Next, click on “slug” on the Checks page:

Q: What if a check has no name?
A: The check will have no corresponding slug URL then:

Q: What if multiple checks in the project have the same name?
A: They will also have the same slug. When you try to ping them, you will get an HTTP 409 response with the text “ambiguous slug” in the response body.

Q: Can I use UUID and slug URLs at the same time?
A: Yes, you can use UUID URLs and slug URLs interchangeably.

Q: Then what does the uuid / slug selector do, exactly?
A: It selects which URL format is used for display in the list and details views on Healthchecks.io.

Current Status

Slug URLs are implemented and ready for use on Healthchecks.io. This is a brand new feature and it will likely receive refinements over time. If you notice problems when using slug URLs, or want to suggest improvements, please send your feedback to contact@healthchecks.io. I will appreciate it!

–Pēteris