Pēteris Caune

OnCalendar schedules: Monitor Systemd Timers with Healthchecks.io

Healthchecks now supports OnCalendar schedules, used for scheduling tasks with systemd timers. Here’s what’s new: when creating a check, you can now switch between “Simple”, “Cron” and “OnCalendar” schedules:

You can also edit schedules (and switch schedule types) for existing checks:

The UI control for entering the schedule is a multi-line textbox, and yes, you can specify multiple schedules there – Healthchecks will expect a ping when any schedule matches:

Note: the schedule field is currently limited to 100 characters. You will be able to enter 2-3 schedules, but probably not 10+ schedules.

systemd allows you to specify a timezone inside the OnCalendar expression. So does Healthchecks:

The API now supports OnCalendar schedules as well. You can pass either cron schedule or OnCalendar expression(s) in the “schedule” field for the Create a new check and Update an existing check calls, and Healthchecks will detect the schedule type automatically:

$ curl -s https://healthchecks.io/api/v3/checks/ \
    --header "X-Api-Key: fdYYw32ftDvYQoCe4C1JUgp7SlPbOYTI" \
    --data '{"name": "Runs at 8AM", "schedule": "8:00"}' | jq .
{
  "name": "Runs at 8AM",
  "slug": "",
  "tags": "",
  "desc": "",
  "grace": 3600,
  "n_pings": 0,
  "status": "new",
  "started": false,
  "last_ping": null,
  "next_ping": null,
  "manual_resume": false,
  "methods": "",
  "subject": "",
  "subject_fail": "",
  "start_kw": "",
  "success_kw": "",
  "failure_kw": "",
  "filter_subject": false,
  "filter_body": false,
  "ping_url": "https://hc-ping.com/97f70e1c-bf2b-4244-ba44-de413c93fab4",
  "update_url": "https://healthchecks.io/api/v3/checks/97f70e1c-bf2b-4244-ba44-de413c93fab4",
  "pause_url": "https://healthchecks.io/api/v3/checks/97f70e1c-bf2b-4244-ba44-de413c93fab4/pause",
  "resume_url": "https://healthchecks.io/api/v3/checks/97f70e1c-bf2b-4244-ba44-de413c93fab4/resume",
  "channels": "",
  "schedule": "8:00",
  "tz": "UTC"
}

Under the hood, the OnCalendar schedule parsing logic is implemented in a separate “oncalendar” library. Feel free to use it in your own Python projects as well!

The OnCalendar schedule support is live on https://healthchecks.io and available to all accounts. Happy monitoring!

–Pēteris

Comparison of Cron Monitoring Services (November 2023)

In this post I’m comparing cron monitoring features of four services: Cronitor, Healthchecks.io, Uptime Robot, Sentry.

How I picked the services for comparison: I searched for “cron monitoring” on Google and picked the top results in their order of appearance.

Disclaimer: I run Healthchecks.io, so I’m a biased source. I’ve tried to get the facts right, but choosing what features to compare, and what differences to highlight, is of course subjective. When in doubt, do your own research!

Business Stats

Cronitor launched in 2014, is registered in the United States and runs on AWS. Cronitor is a bootstrapped company, and is operated by three friendly humans. Cronitor started as a cron monitoring service, but has expanded to website uptime monitoring, real user monitoring, and hosted status pages. Cronitor is a proprietary product and uses the SaaS business model.

Healthchecks.io launched in 2015, is registered in Latvia and runs on Hetzner (Germany). Healthchecks.io is a bootstrapped company, run by a solo founder. Healthchecks.io focuses on doing one thing and doing it well: alerting when something does not happen on time. Healthchecks.io is open source (source on GitHub), users can use the hosted service, or run a self-hosted instance.

Uptime Robot launched in 2010, is registered in Malta, and runs on Limestone Networks, AWS, and DigitalOcean. UptimeRobot started as a free website uptime monitoring service and added cron monitoring and hosted status pages support in 2019. After getting acquired in late 2019, UptimeRobot accelerated development and reorganized its pricing structure. Uptime Robot is a proprietary product and uses the SaaS business model.

Sentry launched in 2012, is registered in the United States and runs on AWS and Google Cloud. Sentry is a VC-funded company and has 200+ employees. Sentry started as an error tracking service, grew into APM, and launched cron monitoring support in public beta in January 2023. Sentry uses the SaaS business model, but its source code is available under the FSL license. Sentry is a complex product with many moving parts. Self-hosting is possible but is not trivial.

Pricing

Each reviewed service except Healthchecks.io bundles several products under one account:

  • Cronitor: cron monitoring, website uptime monitoring, RUM, status pages.
  • Uptime Robot: website uptime monitoring, cron monitoring, status pages.
  • Sentry: error tracking, APM, code coverage.

The total set of functionality you get from a paid account on each service is vastly different, so their pricing is not directly comparable. With that in mind, here is the pricing summary for each service, as of November 2023, for monitoring cron jobs specifically.

Cronitor

  • Free plan: monitor up to 5 jobs.
  • Business plan: $2/mo for 1 job.

Monitoring 100 jobs with Cronitor would cost $200/mo.

Healthchecks.io

  • Free plan: monitor up to 20 jobs.
  • Business plan: $20/mo for 100 jobs.
  • Business Plus plan: $80/mo for 1000 jobs.

Monitoring 100 jobs with Healthchecks.io would cost $20/mo. Healthchecks.io offers sponsored accounts for non-profits and open-source projects (details).

Uptime Robot

  • Solo plan: $8/mo for 10 jobs or $19/mo for 50 jobs.
  • Team plan: $34/mo for 100 jobs.
  • Enterprise plan: $64/mo for 200 jobs.

Monitoring 100 jobs with Uptime Robot would cost $34/mo. Uptime Robot offers sponsored accounts for charities and other non-profits (details).

Sentry

Sentry Cron Monitoring feature is currently in open beta. The limits for different pricing plans are not known yet. Sentry announced general availability and pricing in January 16, 2024.

  • Free: monitor 1 cron job for free.
  • Paid: $0.78/mo for 1 job.

Monitoring 100 jobs with Sentry would cost $77/mo. Sentry offers sponsored accounts for non-profits, open-source, and students (details).

Timeout-based Schedules

When using timeout-based schedules the user specifies a period (for example, one hour). The monitored system is expected to “check in” (send an HTTP request to a unique address) at least every period. When a check-in is missed, the monitoring system declares an outage and notifies you.

This monitoring technique is also sometimes called Heartbeat Monitoring. All four reviewed services support timeout-based schedules.

Cron Expression Schedules

The user specifies a cron expression (for example, “0/5 * * * *”) and a timezone. The monitoring system calculates expected “check in” deadlines based on the cron expression.

Supported by: Cronitor, Healthchecks.io, Sentry.

Not supported by: Uptime Robot.

Cronitor and Sentry use the croniter library to evaluate cron expressions. Healthchecks.io uses the cronsim library.

Start and Fail Signals

In addition to basic “I’m alive!” check-in messages, monitoring services typically support additional signal types:

  • “job started” signal: allows the measurement of job durations, and alerting when a job takes too long.
  • “job failed” signal: allows the job to explicitly declare itself as failed.

Supported by: Cronitor (docs), Healthchecks.io (docs), Sentry (docs).

Not supported by: Uptime Robot.

Check-in Via Email

With this feature, clients can “check in” by sending an email message to a job-specific email address. This comes in handy when integrating with services that only support status reports via emails, or when working in restrictive environments where only email is allowed through.

Supported by: Cronitor (docs), Healthchecks.io (docs).

Not supported by: Uptime Robot, Sentry.

Auto-Provisioning

With auto-provisioning clients can perform check-ins for jobs that the monitoring system does not yet know about, and the monitoring service registers the new jobs on the fly. Auto-provisioning is handy in dynamic environments where the set of monitored jobs changes frequently.

Supported by: Cronitor (docs), Healthchecks.io (docs), Sentry (docs).

Not supported by: Uptime Robot.

Client SDKs and API

Cronitor provides first-party command-line client and SDKs for Java, JavaScript, Kubernetes, PHP, Python, Ruby, and Sidekiq. There are also third-party SDKs for Terraform and .Net.

Healthchecks.io does not provide first-party client SDKs. There are a number of third-party client libraries.

Sentry provides first-party command-line client and SDKs for Celery, Go, Java, JavaScript, Laravel, Node, PHP, Python, Quartz, Rails, Ruby, and Spring.

Uptime Robot does not provide first-party client SDKs.

All four services provide HTTP API: Cronitor API docs, Healthchecks.io API docs, Uptime Robot API docs, Sentry API docs.

Notification Methods

Each reviewed service supports a number of different ways to deliver downtime notifications:

Cronitor

  • Free: email, webhooks (only GET requests), MS Teams, Slack, Telegram.
  • Paid: Opsgenie, PagerDuty, SMS, Splunk On-Call.

Healthchecks.io

  • Free: email, webhooks, Discord, LINE Notify, Matrix, Mattermost, MS Teams, Opsgenie, PagerDuty, PagerTree, Pushbullet, Pushover, Rocket.Chat, Signal, Slack, Spike.sh, Telegram, Trello, Splunk On-Call, Zulip.
  • Paid: SMS, voice calls, WhatsApp.

Uptime Robot

  • Free: Android and iOS app, email, Google Chat, Discord, Pushbullet, Pushover, Splunk On-Call.
  • Paid: webhooks, MS Teams, PagerDuty, Slack, SMS, Telegram, voice calls, Zapier.

Sentry

  • Free: email, webhooks.
  • Paid: Amixr, Discord, MS Teams, Opsgenie, PagerDuty, Pushover, Rocket.Chat, Rootly, Slack, Spike.sh, SMS, TaskCall, Threads, Splunk On-Call.

Project Management, User and Team Management, Authentication

Cronitor

Cronitor supports organizing jobs into Environments. Within each environment, jobs can be grouped into groups. Jobs can be annotated with tags.

Cronitor supports multiple team members ($5/mo for each additional user). Team members can have “admin”, “user”, “readonly” roles.

Cronitor supports SAML2 SSO, which costs an extra $5/mo for every team member. Cronitor does not support two-factor authentication.

Healthchecks.io

Healthchecks.io supports organizing jobs into Projects. Jobs can be annotated with tags.

Healthchecks.io supports multiple team members with “owner”, “manager”, “user”, and “read-only” roles.

Healthchecks.io does not support any form of SSO. Healthchecks.io supports two-factor authentication using WebAuthn and using one-time codes (TOTP).

Uptime Robot

Uptime Robot does not support grouping or tagging jobs.

Uptime Robot’s higher-priced plans support multiple team members with “admin”, “read”, and “write” roles.

Uptime Robot does not support any form of SSO. Uptime Robot supports two-factor authentication using one-time codes (TOTP).

Sentry

Sentry supports organizing jobs into Projects and Environments.

Sentry supports multiple team members with “billing”, “member”, “admin”, “manager”, and “owner” roles.

Sentry offers many options for SSO: Google, GitHub, Okta, SAML2, and others. All options except Google and GitHub require the Business ($80/mo) billing plan. Sentry supports two-factor authentication using U2F, one-time codes (TOTP), and recovery codes.

Feature Matrix

CronitorHealthchecks.ioUptime RobotSentry
Business registered in🇺🇸🇱🇻🇲🇹🇺🇸
Servers hosted in🇺🇸🇩🇪🇺🇸🇺🇸
Team size3110+200+
Founded in2014201520102012
Jobs in the free plan5200?
Price/mo for 100 jobs$200$20$34?
Self-hosting possible
Timeout-based schedules
Cron expressions
“start” and “fail” signals
Check-in via email
Auto-provisioning
Client SDKs
API
Projects
Team access💰💰
Single sign-on💰💰
Two-factor authentication
Notify via email
Notify via webhooks💰
Notify via Slack💰💰
Notify via Telegram💰
Notify via SMS💰💰💰💰

In Closing

If you notice any factual errors, please let me know (contacts), and I will get them fixed ASAP!

There are many more things to compare. If you are shopping for a cron monitoring service, you will have to decide what is important for you, and likely do some additional research.

Happy monitoring,
– Pēteris

Notes on Self-hosted Transactional Email

Since a little more than two months ago, Healthchecks.io has been sending transactional email (~300’000 emails per month) through its own SMTP server. Here are my notes on setting it up.

The Before

Before going self-hosted, Healthchecks sent email using 3rd-party SMTP relays: AWS SES and later Elastic Email.

The reason for switching from AWS to Elastic Email was GDPR compliance: at the time United States did not have an EU adequacy decision, but Canada (the registration country of Elastic Email Inc.) did.

The primary reason I kept looking for alternatives to Elastic Email was also GDPR compliance: a country with EU adequacy decision is good, but being based in the EU is even better. Another reason was their poor communication during service outages: some outages were not acknowledged on their status page, there were no timely updates via support chat or otherwise, and there were no post-mortems published after outages. To their credit, Elastic Email did fix the outages reasonably quickly, and I was overall happy with the service in terms of functionality and pricing.

The EU-based SMTP Relay Options

There are few EU-based SMTP relay services. None of the big names (AWS SES, Sendgrid, Mailgun, Mailchimp, Postmark) are EU-based. I tested a few options:

  • EmailLabs: OK in terms of functionality and pricing. Judging by the mix of Polish and English in the user interface and documentation seemed geared primarily to the Polish market.
  • SMTPeter: OK in terms of functionality and pricing. It was probably just bad timing, but had a major outage while I was testing it. Small shop.
  • Brevo (formerly Sendinblue): the most prominent EU SMTP relay service. Has open and click tracking enabled by default, and refused to turn it off before seeing live production traffic, so a non-starter for me.

None of the options seemed like an upgrade over what I already had, and I kept circling back to the idea of self-hosting. The common wisdom is that self-hosting email means endless deliverability problems, but maybe-maybe?

The Self-Hosting Options

In May 2023, I spent several weeks researching and trialing self-hosted SMTP servers: mox, Postal, Haraka, Zone MTA, OpenSMTPD, and maddy. My brain was getting fried from jumping between documentation sites, trying to make sense of the feature sets, and the pros and cons of each project. One thing that helped immensely was reading Email explained from first principles – it filled many gaps in my knowledge of email delivery.

Maddy

After experimenting with and strongly considering OpenSMTPD, I ultimately picked maddy. I iterated on a test configuration until I got it to do the required things:

  • Accept email on port 465 from authenticated users.
  • Rewrite its envelope sender from “@healthchecks.io” to “@mail.healthchecks.io” (required for routing bounce messages back to the maddy server).
  • Sign it using DKIM protocol.
  • Put outgoing messages in a queue, attempt to deliver them, and retry with exponential backoff.
  • Deliver messages to remote MTAs from a single, specific IP.
  • When delivery fails, send a webhook notification to a designated webhook handler. For permanent failures, the handler can take appropriate action – unsubscribe a specific user from email reports, or mark a specific email integration as disabled.
  • Listen for incoming email on port 25.
  • When a remote MTA sends a DSN (delivery status notification, “bounce message”), deliver it to the same webhook.
  • Use an automatically provisioned LetsEncrypt certificate for TLS encryption on port 465 and port 25.

I wrote the provisioning scripts for deploying Maddy and its configuration to a server. I added and updated the required DNS entries for SPF and DKIM. I implemented, tested, and deployed the webhook handler that would receive bounce notifications from maddy.

IP Warm-Up

I spent several weeks gradually switching outgoing email traffic from Elastic Email to the self-hosted maddy server. IP warm-up serves two purposes:

  • Slowly builds up the reputation of the sending IP address. Switching the entire sending volume to a new IP address all at once risks getting blocked by the receiving servers.
  • It lets me test email delivery in the production environment and fix any potential problems with fewer negative consequences.

The Failover IP Oopsie

One issue I discovered during the IP warm-up phase was that the brand-new mail server (a Hetzner AX41 dedicated server) experienced minute-long network hiccups a couple of times per day. The cause could be a faulty NIC, a faulty switch, or a noisy neighbor, and the easiest fix is ordering another server and hoping for better luck. In anticipation of such a scenario, I had ordered a failover IP so I could keep using the already warmed-up IP with the new server.

I set up a new server, switched the failover IP to it, and after a few days of testing, no more network hiccups! So I went ahead and canceled the original machine. Then, a few days later, around 2 AM local time, my monitoring notifications went off: email delivery was broken. I had assumed that “failover IP” is more or less what other providers call “floating IP.” Dazed and confused in front of a blue screen in the middle of the night, I realized my misunderstanding and mistake: the failover IP is owned by a specific server. Canceling the server also cancels the failover IP with all its sender reputation.

To fix the immediate problem, I temporarily switched the web servers back to using Elastic Email as the SMTP relay. I asked Hetzner support if there was any way I could get the released IP back. Minutes later, I got a reply stating in perfect German calmness that my request would need to be handled by a different department during business hours. The next morning, Hetzner added the lost failover IP back to my account. Phew!

Local Relay-Only MTAs for Reliability

The internet-facing SMTP server (mail.healthchecks.io) runs on a single machine. Each app server also runs a local maddy instance which accepts outgoing email messages from local clients only, and hands them off to mail.healthchecks.io. If mail.healthchecks.io is unavailable (for example, during server restart), the local maddy instances queue the messages and retry them later.

Summary, Pros, and Cons

The self-hosted maddy server has been handling all Healthchecks.io transactional email for over two months. I am keeping an eye on bounce notifications, the outbound email queue size, and blocklists. So far, there have been no significant deliverability issues–fingers crossed!

Cons of going self-hosted:

  • Self-hosted SMTP server is another service to maintain. It uses up the limited time and mental bandwidth I have.
  • The inevitable deliverability problems will be my problems.
  • In the case of maddy, no pre-built graphical management and monitoring dashboard.

And pros:

  • Complete control of subprocessors with access to customer data (just Hetzner in my case).
  • Complete control over server configuration.
  • Fixed direct costs (as long as a single server can keep up with the sending volume).
  • I learned a bunch of new stuff!

Thanks for reading,
Pēteris

New Feature: Check Auto-Provisioning

Healthchecks recently gained a new feature: check auto-provisioning. When you send a ping request to a slug URL, and a check with the specified slug does not exist, Healthchecks can now automatically create the missing check. This feature requires opt-in: to use it, add a ?create=1 query parameter to the ping URL.

Here’s check auto-provisioning in action (the -I parameter tells curl to send HTTP HEAD requests so that we can see HTTP response status codes easily):

$ curl -I https://hc-ping.com/fixme-ping-key/does-not-exist
HTTP/2 404
[...]

$ curl -I https://hc-ping.com/fixme-ping-key/does-not-exist?create=1
HTTP/2 201
[...]

$ curl -I https://hc-ping.com/fixme-ping-key/does-not-exist?create=1
HTTP/2 200 
[...]
  • The first request returns HTTP 404 (“Not Found”) because a check with a slug does-not-exist does, in fact, not yet exist.
  • The second request has a “?create=1” added to the URL to enable auto-provisioning. The server creates a new check and returns HTTP 201 (“Created”).
  • The third request is the same as the second, but a matching check now exists. The server accepts the ping and returns HTTP 200 (“OK”).

When is this useful? Whenever you are working with a dynamic infrastructure, and want your monitoring clients to be able to register with Healthchecks.io automatically. If you distribute the Ping Key to monitoring clients, each client can pick its own slug (for example, derived from the server’s hostname), construct a ping URL (https://hc-ping.com/<ping-key>/<slug-chosen-by-client>?create=1), and Healthchecks.io will auto-create a new check on the first ping.

Auto-Provisioned Checks Use Default Configuration

With the current auto-provisioning implementation, clients can create new checks on the fly, but they cannot yet specify the period, the grace time, the enabled integrations, or any other parameters. The new checks will be created with default parameters (period = 1 day, grace time = 1 hour, all integrations enabled). If you need to change any parameters, you will need to do this either manually from the web dashboard, or from a script that calls Management API.

Auto-Provisioning and Account Limits

Each account has a specific limit of how many checks it is allowed to create: 20 checks for free accounts; 100 or 1000 checks for paid accounts. To reduce friction and the risk of silent failures, the auto-provisioning functionality is allowed to temporarily exceed the account’s check limit up to two times. Meaning, if your account is already maxed out, auto-provisioning will still be able to create new checks until you hit two times the limit. If your account goes over the limit, you will start to see warnings in the dashboard and email:

As soon as you get the number of checks in your account below the limit (either by upgrading to higher limits, or by removing unneeded checks), the warning will go away. If you do not resolve the warning for more than a month, you will start seeing an “Account marked for deletion” notice in the dashboard. After another month of inaction, the account will be deleted.

Slugs and Names Are Now Separate

In the initial slug implementation check slugs were tied to check names. Changing a check’s name also updated its slug. With the introduction of auto-provisioning, check names and slugs are now decoupled. You can hand-pick a custom slug for each check. You can also rename a check but keep its existing slug.

The “Name and Tags” dialog has gained a new, editable “Slug” field:

Similarly, the Create a Check and Update an Existing Check API calls now support a new slug field.

Happy monitoring,
–Pēteris

Walk-through: Set Up Self-Hosted Healthchecks Instance on a VPS

In this guide, I will deploy a Healthchecks instance on a VPS. Here’s the plan:

  • Use the official Docker image and run it using Docker Compose.
  • Store data in a managed PostgreSQL database.
  • Use LetsEncrypt certificates initially, and load-balancer-managed certificates later for a HA setup.
  • Use an external SMTP relay for sending email.

Prerequisites:

  • A domain name (and access to its DNS settings).
  • A payment card (for setting up a hosting account)
  • Working SMTP credentials for sending emails.

Hosting Setup

For this exercise, I’m using UpCloud as the hosting provider. I’m choosing UpCloud because they are a European cloud hosting provider that I have not used before, and they offer managed databases.

I registered for an account, deposited €10, and launched the cheapest server they offer (1 core, 1GB RAM, €7/mo) with Ubuntu 22.04 as the OS. On the new server, I:

  • Installed OS updates (apt update && apt upgrade).
  • Disabled SSH password authentication
  • Installed Docker by following the official instructions.
  • Created a non-root user, set up SSH authentication for it, and added it to the “docker” group.

Basic docker-compose.yml

On the server, logged in as the non-root user, I created a docker-compose.yml file with the following contents:

version: "3"

services:
  web:
    image: healthchecks/healthchecks:v2.8.1
    restart: unless-stopped
    environment:
      - DB_NAME=/tmp/hc.sqlite

I then ran docker compose up. The Healthchecks container started up, but I could not access it from the browser yet: it does not expose any ports, it has no domain name, and there is no TLS terminating proxy yet.

Add DNS records, Add caddy, Add ALLOWED_HOSTS, SITE_ROOT

I own a domain name “monkeyseemonkeydo.lv”, and for this Healthchecks instance I used the subdomain “hc.monkeyseemonkeydo.lv”. I created two new DNS records:

hc.monkeyseemonkeydo.lv A 94.237.80.66
hc.monkeyseemonkeydo.lv AAAA 2a04:3542:1000:910:80a5:5cff:fe7f:0a17

(These are of course the IPv4 and IPv6 addresses of the UpCloud server).

In docker-compose.yml I added a new “caddy” service to act as a TLS terminating reverse proxy, and I added ALLOWED_HOSTS and SITE_ROOT environment variables in the “web” service:

version: "3"

services:
  caddy:
    image: caddy:2.6.4
    restart: unless-stopped
    command: caddy reverse-proxy --from https://hc.monkeyseemonkeydo.lv:443 --to http://web:8000
    ports:
      - 80:80
      - 443:443
    volumes:
      - caddy:/data
    depends_on:
      - web

  web:
    image: healthchecks/healthchecks:v2.8.1
    restart: unless-stopped
    environment:
      - ALLOWED_HOSTS=hc.monkeyseemonkeydo.lv
      - DB_NAME=/tmp/hc.sqlite
      - SITE_ROOT=https://hc.monkeyseemonkeydo.lv
volumes:
  caddy:

Note: Caddy needs a persistent “/data” volume for storing TLS certificates, private keys, OCSP staples, and other information.

After running docker compose up again, the site loads in the browser:

Add DEBUG=False and SECRET_KEY

Next, I added DEBUG and SECRET_KEY environment variables. DEBUG=False turns off the debug mode, which should always be off on public-facing sites. SECRET_KEY is used for cryptographic signing and should be set to a unique, secret value. Do not copy the value I used!

environment:
  [...]
  - DEBUG=False
  - SECRET_KEY=b553f395-2aa1-421a-bcf5-d1c1456776d7
  [...]

Launch PostgreSQL Database, Add Database Credentials

I created a managed PostgreSQL database in the UpCloud account. I selected PostgreSQL 15.1, and the lowest available spec (1 node, 1 core, 2GB RAM, €30/mo). I made sure to select the same datacenter that the web server is in.

After the database server started up, I took note of the connection parameters: host, port, username, password, and database name. Since I was planning to use this database server for the Healthchecks instance and nothing else, I used the default database user (“upadmin”) and the default database (“defaultdb”). Here is the database configuration:

environment:
  [...]
  - DB=postgres
  - DB_HOST=postgres-************.db.upclouddatabases.com
  - DB_PORT=11550
  - DB_NAME=defaultdb
  - DB_USER=upadmin
  - DB_PASSWORD=AVNS_*******************
  [...]

After another docker compose up, I created a superuser account:

docker compose run web /opt/healthchecks/manage.py createsuperuser

I tested the setup by signing in as the superuser:

Configure Outgoing Email

The Healthchecks instance needs valid SMTP credentials for sending email.

For a production site, I would sign up for an SMTP relay service. Since I’m setting this instance up only for demonstration purposes, and the volume of sent emails will be very low, I used my personal mail (hosted by Fastmail) SMTP credentials.

Here are the new environment variables:

environment:
  [...]
  - ADMINS=meow@monkeyseemonkeydo.lv
  - DEFAULT_FROM_EMAIL=meow@monkeyseemonkeydo.lv
  - EMAIL_HOST=smtp.fastmail.com
  - EMAIL_HOST_USER=meow@monkeyseemonkeydo.lv
  - EMAIL_HOST_PASSWORD=****************
  [...]

The ADMINS setting sets the email addresses that will receive error notifications. The DEFAULT_EMAIL_FROM sets the “From:” address for emails from this Healthchecks instance.

Disable New User Signups

The new Healthchecks instance currently allows any visitor to sign up for an account. This will be a private instance, so I disabled new user registration via the REGISTRATION_OPEN environment variable:

environment:
  [...]
  - REGISTRATION_OPEN=False
  [...]

Add Pinging by Email

Healthchecks supports pinging (sending heartbeat messages from clients) via HTTP and also via email. To enable pinging via email, I set the PING_EMAIL_DOMAIN and SMTPD_PORT environment variables, and exposed port 25:

environment:
  [...]
  - PING_EMAIL_DOMAIN=hc.monkeyseemonkeydo.lv
  - SMTPD_PORT=25
  [...]
ports:
  - 25:25        

After another docker compose up, I sent a test email and verified its arrival:

Add Logo and Site Name

The default logo image is located at /opt/healthchecks/static-collected/img/logo.png inside the “web” container. To use a custom logo, one can either set the SITE_LOGO_URL environment variable or mount a custom logo over the default one. I used the latter method.

I used an image from the Noto Emoji font as the logo, placed it next to docker-compose.yml on the server, and picked a site name:

environment:
  [...]
  - SITE_NAME=MeowOps
  [...]
volumes:
  - $PWD/logo.png:/opt/healthchecks/static-collected/img/logo.png

The result:

The Complete docker-compose.yml

Putting it all together, here is the complete docker-compose.yml:

version: "3"

services:
  caddy:
    image: caddy:2.6.4
    restart: unless-stopped
    command: caddy reverse-proxy --from https://hc.monkeyseemonkeydo.lv:443 --to http://web:8000
    ports:
      - 80:80
      - 443:443
    volumes:
      - caddy:/data
    depends_on:
      - web

  web:
    image: healthchecks/healthchecks:v2.8.1
    restart: unless-stopped
    environment:
      - ADMINS=meow@monkeyseemonkeydo.lv
      - DEBUG=False
      - ALLOWED_HOSTS=hc.monkeyseemonkeydo.lv
      - DB=postgres
      - DB_HOST=postgres-************.db.upclouddatabases.com
      - DB_PORT=11550
      - DB_NAME=defaultdb
      - DB_USER=upadmin
      - DB_PASSWORD=AVNS_*******************
      - DEFAULT_FROM_EMAIL=meow@monkeyseemonkeydo.lv
      - EMAIL_HOST=smtp.fastmail.com
      - EMAIL_HOST_USER=meow@monkeyseemonkeydo.lv
      - EMAIL_HOST_PASSWORD=****************
      - PING_EMAIL_DOMAIN=hc.monkeyseemonkeydo.lv
      - REGISTRATION_OPEN=False
      - SECRET_KEY=b553f395-2aa1-421a-bcf5-d1c1456776d7
      - SITE_NAME=MeowOps
      - SITE_ROOT=https://hc.monkeyseemonkeydo.lv
      - SMTPD_PORT=25
    ports:
      - 25:25
    volumes:
      - $PWD/logo.png:/opt/healthchecks/static-collected/img/logo.png

volumes:
  caddy:

HA

With the current setup, the web server and the database are both single points of failure. For a production setup, it would be desirable to have as few single points of failure as possible.

The database part is easy, as UpCloud-managed databases support HA configurations. I changed the database plan from 1 node to 2 HA nodes (2 cores, 4GB RAM, €100/mo) and that was that. I did not even need to restart the web container.

The web server part is more complicated: launch a second web server, put a managed load balancer in front of both web servers, and move TLS termination to the load balancer. I updated docker-compose.yml yet again:

version: "3"

services:
  web:
    image: healthchecks/healthchecks:v2.8.1
    restart: unless-stopped
    environment:
      - ADMINS=meow@monkeyseemonkeydo.lv
      - DEBUG=False
      - DB=postgres                         
      - DB_HOST=postgres-************.db.upclouddatabases.com
      - DB_PORT=11550
      - DB_NAME=defaultdb
      - DB_USER=upadmin                         
      - DB_PASSWORD=AVNS_*******************
      - DEFAULT_FROM_EMAIL=meow@monkeyseemonkeydo.lv
      - EMAIL_HOST=smtp.fastmail.com
      - EMAIL_HOST_USER=meow@monkeyseemonkeydo.lv                            
      - EMAIL_HOST_PASSWORD=****************
      - PING_EMAIL_DOMAIN=hc.monkeyseemonkeydo.lv
      - REGISTRATION_OPEN=False
      - SECRET_KEY=b553f395-2aa1-421a-bcf5-d1c1456776d7
      - SITE_NAME=MeowOps
      - SITE_ROOT=https://hc.monkeyseemonkeydo.lv
      - SMTPD_PORT=25
    ports:
      - 10.0.0.2:8000:8000
      - 10.0.0.2:25:25
    volumes:
      - $PWD/logo.png:/opt/healthchecks/static-collected/img/logo.png
  • I removed the “caddy” service since the load balancer will now be terminating TLS.
  • I removed the ALLOWED_HOSTS setting. This was required to get the load balancer health checks to work (UpCloud’s load balancer does not send the Host request header).
  • I exposed port 8000 of the “web” service on a private IP that the load balancer will connect through.
  • I updated the port 25 entry to bind only to the private IP.

The following steps are UpCloud-specific, not Healthchecks-specific, so I will only summarize them:

  • I launched a second web server and set it up identically to the existing one.
  • I created a managed load balancer (2 HA nodes, €30/mo).
  • I replaced the “A” and “AAAA” DNS records for hc.monkeyseemonkeydo.lv with a CNAME record that points to the load balancer’s hostname.
  • I configured the load balancer to terminate TLS traffic on port 443, add X-Forwarded-For request headers, and proxy the HTTP requests to the web servers.
  • I configured the load balancer to proxy TCP connections on port 25 to port 25 on the web servers.

Costs

For the single-node setup:

  • Web server: €7/mo.
  • Database: €30/mo.
  • Total: €37/mo.

For the HA setup:

  • Web servers: 2 × €7/mo.
  • Database: €100/mo.
  • Load balancer: €30/mo.
  • Total: €144/mo.

Monitoring, Automation, Documentation

At this point, the Healthchecks instance is up and running and the walk-through is complete. For real-world deployment, also consider the following tasks:

  • Set up uptime monitoring using your preferred uptime monitoring service.
  • Set up CPU / RAM / disk / network monitoring using your preferred infrastructure monitoring service.
  • Set up monitoring for notification delivery.
  • Move secret values out of docker-compose.yml, and store docker-compose.yml under version control.
  • Document the web server setup and update procedures.
  • Automate the setup and update tasks if and where it makes sense.

Thanks for reading, and good luck in your self-hosting adventures,
–Pēteris

Monitor Disk Space on Servers Without Installing Monitoring Agents

Let’s say you want to get an email notification when the free disk space on your server drops below some threshold level. There are many ways to go about this, but here is one that does not require you to install anything new on the system and is easy to audit (it’s a 4-line shell script).

The df Utility

df is a command-line program that reports file system disk space usage, and is usually preinstalled on Unix-like systems. Let’s run it:

$ df -h /
Filesystem                         Size  Used Avail Use% Mounted on
/dev/mapper/ubuntu--vg-ubuntu--lv   75G   23G   51G  32% /

The “-h” argument tells df to print sizes in human-readable units. The “/” argument tells df to only output stats about the root filesystem. The “Use%” field in the output indicates the root filesystem is 32% full. If we wanted to extract just the percentage, df has a handy “–output” argument:

$ df --output=pcent /
Use%
 32%

We can use tail to drop the first line, and tr to delete the space and percent-sign characters, leaving just the numeric value:

$ df --output=pcent / | tail -n 1 | tr -d '% '
32

The Disk Space Monitoring Script

Here is a shell script that looks up the free disk space on the root filesystem, compares it to a defined threshold value (75 in this example), then does some action depending on the result:

pct=$(df --output=pcent / | tail -n 1 | tr -d '% ')
if [ $pct -gt 75 ]; 
then
    // FIXME: the command to run when above the threshold 
else
    // FIXME: the command to run when below the threshold
fi

We can save this as a shell script, and run it from cron at regular intervals. Except the script does not yet handle the alerting part of course. Some things to consider:

Healthchecks.io

Healthchecks.io, a cron job monitoring service, can help with the alerting part:

  • You can send monitoring signals to Healthchecks.io via HTTP requests using curl or wget.
  • Healthchecks.io handles the email delivery (as well as Slack, Telegram, Pushover, and many other options).
  • Healthchecks.io sends notifications only on state changes – when something breaks or recovers. It will not spam you with ongoing reminders unless you tell it to.
  • It will also detect when your monitoring script goes AWOL. For example, when the whole system crashes or loses the network connection.

In your Healthchecks.io account, create a new Check, give it a descriptive name, set its Period to “10 minutes”, and copy its Ping URL.

The monitoring API is super-simple. To signal success (disk usage is below threshold), send an HTTP request to the Ping URL directly:

curl https://hc-ping.com/your-uuid-here

And, to signal failure, append “/fail” at the end of the Ping URL:

curl https://hc-ping.com/your-uuid-here/fail

Let’s integrate this into our monitoring script:

url=https://hc-ping.com/your-uuid-here
pct=$(df --output=pcent / | tail -n 1 | tr -d '% ')
if [ $pct -gt 75 ]; then url=$url/fail; fi
curl -fsS -m 10 --retry 5 -o /dev/null --data-raw "Used space on /: $pct%" $url

The curl call here has a few extra arguments:

  • “-fsS” tells curl to suppress output except for error messages
  • “-m 10” sets a 10-second timeout for HTTP requests
  • “–retry 5” tells curl to retry failed requests up to 5 times
  • “-o /dev/null” sends the server’s response to /dev/null
  • “–data-raw …” specifies a log message to include in an HTTP POST request body

Save this script in a convenient location, for example, in /opt/check-disk-space.sh, and make it executable. Then edit crontab (crontab -e) and add a new cron job:

*/10 * * * * /opt/check-disk-space.sh

Cron will run the script every 10 minutes. On every run, the script will check the used disk space, and signal success (disk usage below or at threshold) or failure (disk usage above threshold) to Healthchecks.io. Whenever the status value flips, Healthchecks.io will send you a notification:

You will also see a log of the monitoring script’s check-ins in the Healthchecks.io web interface:

Closing Comments

If your use case involves handling millions of small files, at least on ext4 filesystems, the filesystem can also run out of inodes. Run df -i to see how many inodes are in use and how many are available. If inode use is a potential concern, you could update the check-disk-space.sh script to track it too.

The shell script + Healthchecks.io pattern would work for monitoring other system metrics too. For example, you could have a script that checks the system’s 15-minute load average, the number of files in a specific directory, or a temperature sensor’s reading.

If you are looking to monitor more than a couple of system metrics though, look into purpose-built system monitoring tools such as netdata. The shell script + Healthchecks.io approach works best when you have a few specific metrics you care about, and you want to avoid installing full-blown monitoring agents in your system.

Thanks for reading and happy monitoring,
–Pēteris.

Making HTTP requests with Arduino and ESP8266

A Healthchecks user sent me a code snippet for sending HTTP pings from Arduino. This prompted me to do some Arduino experimenting on my own. I ordered Arduino Nano 33 IoT board:

Arduino Nano 33 IoT

I picked this board because I wanted an easy entry into Arduino development. As a first-party Arduino hardware, it should be easy to get it working with Arduino IDE. It has an on-board WIFI chip, so I would not need to hook up additional WiFi or Ethernet hardware.

The Nano 33 IoT has a micro USB port. After connecting to my PC running Ubuntu, Arduino’s power LED lit up, and on the computer side a /dev/ttyACM0 device appeared. Arduino IDE detected the connected board, but my initial attempt to upload a sketch failed. This turned out to be a permissions issue. After I added my OS user to the dialout group, I could upload a “Hello World” sketch to the board:

Sending a Raw HTTP Request

Arduino Nano 33 IoT has an on-board WiFi module. To use it, Arduino provides the WiFiNINA library. The library comes with example code snippets. One of the examples shows how to connect to a WiFi network and make an HTTP request. I adapted it to make an HTTPS request to hc-ping.com:

#include <WiFiNINA.h>
#include "arduino_secrets.h"

char ssid[] = SECRET_SSID;
char pass[] = SECRET_PASS; 
int status = WL_IDLE_STATUS;             
WiFiSSLClient client;

void setup() {
  pinMode(LED_BUILTIN, OUTPUT);
  Serial.begin(9600);
  while (!Serial);

  Serial.print("Connecting ...");
  WiFi.begin(ssid, pass);
  while (WiFi.status() != WL_CONNECTED) {
    delay(500);
    Serial.print(".");
  }

  Serial.print("\nConnected, IP address: ");
  Serial.println(WiFi.localIP());  
}

void ping() {
  Serial.println("Pinging hc-ping.com...");
  if (client.connect("hc-ping.com", 443)) {
    Serial.println("Connected to server.");
    client.println("GET /da840100-3f58-405e-a5ee-e7e6e4303e82 HTTP/1.0");
    client.println("Host: hc-ping.com");
    client.println("Connection: close");
    client.println();
    Serial.println("Request sent.");
  }

  while (client.connected()) {
    while (client.available()) {
      char c = client.read();
      Serial.write(c);
    }
  }
  Serial.println("\nClosing connection.");
  client.stop();
}

void loop() {
  ping();

  // Blink LED for 10 seconds:
  Serial.print("Waiting 10s: ");
  for (int i=0; i<10; i++) {
    Serial.print(".");
    digitalWrite(LED_BUILTIN, HIGH);  
    delay(500);                      
    digitalWrite(LED_BUILTIN, LOW);  
    delay(500);  
  } 
  Serial.println();
}

After uploading this sketch to Arduino, here’s the output on serial console:

Connecting ...
Connected, IP address: 192.168.1.77
Pinging hc-ping.com...
Connected to server.
Request sent.
HTTP/1.1 200 OK
server: nginx
date: Thu, 30 Mar 2023 12:33:25 GMT
content-type: text/plain; charset=utf-8
content-length: 2
access-control-allow-origin: *
ping-body-limit: 100000
connection: close

OK
Closing connection.
Waiting 10s: ..........
Pinging hc-ping.com...
Connected to server.
Request sent.
HTTP/1.1 200 OK
server: nginx
date: Thu, 30 Mar 2023 12:33:41 GMT
content-type: text/plain; charset=utf-8
content-length: 2
access-control-allow-origin: *
ping-body-limit: 100000
connection: close

OK
Closing connection.
Waiting 10s: .......
[...]

Quite impressively, this works over HTTPS out of the box – the WiFiNINA library and the chip takes care of performing TLS handshake and verifying the certificates. All I had to do was specify port 443, and the rest was handled automagically.

ArduinoHttpClient

After getting the minimal example working, I found the ArduinoHttpClient library. It offers a higher-level interface for making GET and POST requests, and for parsing server responses. It works with several different network libraries, including WifiNINA.

#include <ArduinoHttpClient.h>
#include <WiFiNINA.h>
#include "arduino_secrets.h"

char ssid[] = SECRET_SSID;
char pass[] = SECRET_PASS; 
int status = WL_IDLE_STATUS;             
WiFiSSLClient wifi;
char host[] = "hc-ping.com";
char uuid[] = UUID;
HttpClient client = HttpClient(wifi, host, 443);

void setup() {
  pinMode(LED_BUILTIN, OUTPUT);
  Serial.begin(9600);
  while (!Serial);

  Serial.print("Connecting ...");
  WiFi.begin(ssid, pass);
  while (WiFi.status() != WL_CONNECTED) {
    delay(500);
    Serial.print(".");
  }

  Serial.print("\nConnected, IP address: ");
  Serial.println(WiFi.localIP());  
}

void loop() {
  client.get("/" + String(uuid));
  Serial.print("Status code: ");
  Serial.println(client.responseStatusCode());
  Serial.print("Response: ");
  Serial.println(client.responseBody());

  // Blink LED for 10 seconds:
  Serial.print("Waiting 10s: ");
  for (int i=0; i<10; i++) {
    Serial.print(".");
    digitalWrite(LED_BUILTIN, HIGH);  
    delay(500);                      
    digitalWrite(LED_BUILTIN, LOW);  
    delay(500);  
  } 
  Serial.println();
}

Output in the serial console:

Connecting ...
Connected, IP address: 192.168.1.77
Status code: 200
Response: OK
Waiting 10s: ..........
Status code: 200
Response: OK
Waiting 10s: ..........
[...]

ESP8266

After having good results with Arduino Nano 33 IoT, I wanted to try the same on an ESP8266 board I had lying around:

ESP8266 on a carrier board

This board from AliExpress has a few goodies in addition to the ESP8266 chip: a relay, and multiple powering options: 220V AC, 7-12V DC, 5V DC. It has a USB port, but this port can be used for supplying power only, there is no USB-UART interface onboard. There are clearly labeled GND, 5V, RX, TX pins that I can hook a USB-UART converter (also from AliExpress) to:

ESP8266 with a USB-serial converter hooked up

The yellow jumper connects GPIO 0 to the ground, this puts ESP8266 in programming mode. At this point I can plug the USB-UART converter in the PC and check for signs of life using esptool:

$ apt-get install esptool
$ esptool chip_id
esptool.py v2.8
Found 2 serial ports
Serial port /dev/ttyUSB0
Connecting...
Detecting chip type... ESP8266
Chip is ESP8266EX
Features: WiFi
Crystal is 26MHz
MAC: a8:48:fa:ff:15:45
Enabling default SPI flash mode...
Chip ID: 0x00ff1545
Hard resetting via RTS pin...

Arduino IDE does not support ESP8266 chips out of the box, but there is esp8266/Arduino project which adds support for different flavors of ESP boards.

esp8266 library in Arduino IDE’s Board Manager view

The esp8266/Arduino project also comes with a WiFi library, which provides an interface to the WiFi functionality on the chip. For simple use cases, the esp8266wifi library is a drop-in replacement for the WiFiNINA library:

#include <ArduinoHttpClient.h>
#include <ESP8266WiFi.h>
#include "arduino_secrets.h"

char ssid[] = SECRET_SSID;
char pass[] = SECRET_PASS; 
WiFiClient wifi;
char host[] = "hc-ping.com";
char uuid[] = UUID;
HttpClient client = HttpClient(wifi, host, 80);

void setup() {  
  pinMode(LED_BUILTIN, OUTPUT);

  Serial.begin(115200);
  while (!Serial);  
  Serial.println();

  WiFi.begin(ssid, pass);

  Serial.print("Connecting ...");
  while (WiFi.status() != WL_CONNECTED) {
    delay(500);
    Serial.print(".");
  }

  Serial.print("\nConnected, IP address: ");
  Serial.println(WiFi.localIP());
}

void loop() {
  client.get("/" + String(uuid));
  Serial.print("Status code: ");
  Serial.println(client.responseStatusCode());
  Serial.print("Response: ");
  Serial.println(client.responseBody());

  // Blink LED for 10 seconds:
  Serial.print("Waiting 10s: ");
  for (int i=0; i<10; i++) {
    Serial.print(".");
    digitalWrite(LED_BUILTIN, HIGH);  
    delay(500);                      
    digitalWrite(LED_BUILTIN, LOW);  
    delay(500);  
  } 
  Serial.println();  
}

Although the esp8266wifi library does support TLS, the documentation also mentions significant CPU and memory requirements. To keep things simple and quick, I went with port 80 and unencrypted HTTP for this experiment.

I uploaded the sketch, removed the yellow jumper, reset the board, and got this output on the serial console:

Connecting ..........
Connected, IP address: 192.168.1.78
Status code: 200
Response: OK
Waiting 10s: ..........
Status code: 200
Response: OK
Waiting 10s: ..........
[...]

Success!

In summary, my first steps in Arduino development left me with positive impressions. The network libraries provide an easy to use, high-level interface for working with network hardware. They have uniform interfaces, so can be used in sketches interchangeably, with minimal code changes. After the initial hump of getting a board recognized by Arduino IDE, and getting the first sketch to upload and run, the development went smoothly. To be fair, the “development” in my case was mostly copying and tweaking code samples. But it was still good!

Happy tinkering,
–Pēteris

I Found a Vulnerability in Croniter

(This was in 2021, and the maintainer fixed it the next day.)

croniter is a python library for evaluating cron expressions. Given a cron expression and a datetime, it can calculate the nearest next datetime that matches the given cron expression.

I was fuzzing the croniter library with atheris, and found an expression that caused the python process to eat CPU for about 10 seconds and then crash:

>>> it = croniter("0-1000000000 * * * *", datetime.now())
Killed

When I made the number bigger, I got a MemoryError:

>>> it = croniter("0-1000000000000 * * * *", datetime.now())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/user/venvs/croniter-fuzzing/lib/python3.8/site-packages/croniter/croniter.py", line 115, in __init__
    self.expanded, self.nth_weekday_of_month = self.expand(expr_format)
  File "/home/user/venvs/croniter-fuzzing/lib/python3.8/site-packages/croniter/croniter.py", line 634, in expand
    return cls._expand(expr_format)
  File "/home/user/venvs/croniter-fuzzing/lib/python3.8/site-packages/croniter/croniter.py", line 584, in _expand
    e_list += (["{0}#{1}".format(item, nth) for item in rng]
MemoryError

And when I made it even bigger, I got an OverflowError:

>>> it = croniter("0-1000000000000000000000000 * * * *", datetime.now())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/user/venvs/croniter-fuzzing/lib/python3.8/site-packages/croniter/croniter.py", line 115, in __init__
    self.expanded, self.nth_weekday_of_month = self.expand(expr_format)
  File "/home/user/venvs/croniter-fuzzing/lib/python3.8/site-packages/croniter/croniter.py", line 634, in expand
    return cls._expand(expr_format)
  File "/home/user/venvs/croniter-fuzzing/lib/python3.8/site-packages/croniter/croniter.py", line 584, in _expand
    e_list += (["{0}#{1}".format(item, nth) for item in rng]
OverflowError: Python int too large to convert to C ssize_t

Clearly, this version of croniter is missing a range check somewhere, and is attempting to do an O(n) operation with a user-supplied n. How bad is it? It depends on where and how the library is used:

  • If this can be triggered in a web application by a web request, an attacker can mount an easy DOS attack.
  • If a bad expression slips deeper in a scheduling or monitoring system, it could cause crashes or even crash loops in its internal processes.

I reported the issue privately to the maintainer, they acknowledged the issue the same day, and had a fixed version (v1.0.5) out the next day.

I upgraded Healthchecks.io to use the fixed version, and also contacted some of the larger croniter users I could find (Apache Airflow, Sentry.io, Cronitor.io) to let them know about the issue and the fix.

Since then I’ve written a separate, smaller cron expression handling library, cronsim, and switched Healthchecks.io to using it. Some of the other bugs I found by fuzzing croniter were hard to fix due to how the code has evolved. At one point I realized a clean rewrite made more sense.

PS. One extra thing before I go: crontab.guru in Chrome does not like this expression either 🙂

Happy bug hunting,
Pēteris

Using Logs to Troubleshoot Failing Cron Jobs

Let’s say you have a script that works when run in an interactive session, but does not produce expected results when run from cron. What could be the problem? Some potential culprits include:

  • The cron daemon does not see the new job. You added the job definition to a file that the cron daemon does not read.
  • The schedule is wrong. The job will run eventually, but not at your intended schedule.
  • cron’s PATH environment variable is different from the one in your interactive shell, so the script could not find a binary.
  • cron uses /bin/sh, which may be /usr/bin/dash instead of /usr/bin/bash, and your script relies on a bash-only feature.
  • The script uses relative filesystem paths and is intended to be run from a specific directory. When run by cron, the filesystem paths are all wrong.

Or it could be something else. How to troubleshoot this then, and where to start? Instead of trying fixes at random, I prefer to start by looking at logs:

  • Look at the system logs to see if cron ran the script at all.
  • Inspect the script’s stdout and stderr output for error messages and other clues.

System Logs

To check system logs on modern Linux systems using systemd, use the “journalctl” command:

journalctl -t CRON --since "today"

The “-t CRON” argument tells journalctl to show log entries with the “CRON” tag only. The “–since” parameter accepts timestamps and time durations in various formats. A couple of examples:

journalctl -t CRON --since "30 minutes ago"
journalctl -t CRON --since "2023-01-25"

If cron did run the job, you will find log entries that look like this:

jan 25 15:27:01 foo CRON[511824]: (user) CMD (/home/user/make-backup.sh)
jan 25 15:27:01 foo CRON[511823]: (CRON) info (No MTA installed, discarding output)

The first line shows the command line cron tried to run. The second line shows that the command generated some output, and cron discarded it. If the command completes without producing any output, or if your system has an MTA such as Postfix or sSMTP installed, the second line will be absent.

Checking Script’s Output

If a cron job produces output, cron will attempt to email the command’s output to the email address specified in the MAILTO= line in crontab. For this to work, the system needs to have a configured message transfer agent (MTA). See How to Send Email From Cron Jobs for instructions on how to configure sSMTP as a MTA.

Logging to System Logs

If you want to avoid the hassle of setting up a working MTA, a simpler option is to pipe the script’s output to the system log:

0 4 * * * /home/user/make-backup.sh 2>&1 | logger -t backups

Let’s analyze the above line:

  • The cron expression 0 4 * * * means “run this job ar 4:00 every day”.
  • /home/user/make-backup.sh is the script we want cron to run.
  • 2>&1 redirects sderr to stdout.
  • |, the pipe character, pipes output (both stdout and stderr, thanks to the redirection) from the previous command into the following command.
  • logger -t backups reads data from stdin and writes it to systemd logs, tagged as “backups”.

After the cron job has run, you can inspect its output with the journalctl utility:

journalctl -t backups --since "today"

To view live logs in the follow mode:

journalctl -t backups -f

Logging to Files

An even simpler option for an ad-hoc debugging session is to write the script’s output to a file. In this example, the log file gets overwritten each time the job runs, so it will only contain the output of the most recent run:

0 4 * * * /home/user/make-backup.sh > /tmp/backups.log 2>&1
  • > (right angle bracket) redirects output to a file.
  • 2>&1 redirects sderr to stdout.

Note: bash has a shorthand for redirecting stdout and stderr to a file: &> /path/to/file. dash does not support it, so it is safer to use the longer form: > /path/to/file 2>&1.

In this example, logs from each run get appended to a single file, and each line is prefixed with a timestamp:

0 4 * * * /home/user/make-backup.sh 2>&1 | ts >> /tmp/backups.log
  • 2>&1 redirects sderr to stdout.
  • The combined output is piped to ts, which prefixes each line with a timestamp.
  • Finally, >> redirects output to a file in append mode.

Note: The “ts” utility may not be installed by default. For Debian-based systems, it is available in the “moreutils” package.

Logging to Healthchecks.io

And yet another option is to forward the logs to Healthchecks.io and view them in the web browser. With this option, you also get external monitoring and notifications when your job does not complete on schedule or exits with non-zero exit status:

0 4 * * * m=$(/home/user/make-backup.sh 2>&1); curl --data-raw "$m" https://hc-ping.com/<uuid>/$?
  • The m=$(...) syntax runs the enclosed command and assigns its output to the variable $m
  • The semicolon separates two consecutive commands (without piping one’s output to the other)
  • curl’s --data-raw "$m" parameter sends the contents of $m in HTTP request body
  • $? is the exit status of the previous command. We tack it onto the URL so Healthchecks knows if the command finished successfully (exit status 0) or unsuccessfully (exit status > 0).

If the script produces a lot of output, you may get an error:

/usr/bin/curl: Argument list too long

In that case, one workaround is to save the output to a temporary file, then tell curl to read request body from the file:

0 4 * * * /home/user/make-backup.sh > /tmp/backups.log 2>&1; curl --data-binary @/tmp/backups.log https://hc-ping.com/<uuid>/$?

As the command starts to get a little unwieldly, you may also consider replacing curl with runitor, a command runner with Healthchecks.io integration:

0 4 * * * runitor -uuid <uuid> -- /home/user/make-backup.sh 

That is all for now,
Happy scripting,
Pēteris

How Healthchecks Sends Signal Notifications

When a cron job does not run on time, Healthchecks can notify you using various methods. One of the supported methods is Signal messages. Signal is an end-to-end encrypted messenger app run by a non-profit Signal Foundation. Signal’s mobile client, desktop client, and server are free and open-source software (with some exceptions–read on!).

No Incoming Webhooks

Unlike most other instant messaging services, Signal does not provide incoming webhooks for posting messages in chats. If you want to send messages on the Signal network, you must run a full client, and follow all the same cryptographic protocols that normal end-user clients follow. This is inconvenient for the integration developer but makes sense: the main feature of Signal is strong cryptography and as little as possible sharing of information with the Signal servers. The servers pass around messages, and help with peer discovery, but (as far as I know) cannot send their own messages on the user’s behalf. Official incoming webhooks would conflict with the overall architecture of the system.

signal-cli

signal-cli is a third-party open-source Signal client. It uses the same signal client libary that the official clients use but offers a programmatic interface for sending and receiving messages. signal-cli supports command-line, DBUS, and JSON-RPC interfaces.

Signal’s official position on the signal-cli client seems to be–they do not support it, but they also have not explicitly banned it. When I asked Signal Support about their stance regarding signal-cli (and also about advice regarding rate-limit issues discussed below), I got just this short response back:

Due to our limitations as a non-profit organization, we can only provide support for the product we provide. Signal-cli is not provided or maintained by us, therefore we cannot provide any support for it.

Using signal-cli in Healthchecks

I coded the initial Signal integration in January 2021. To send messages, it was running a signal-cli send -m 'text goes here' command for every message. Each send took a minimum of one second, as every signal-cli invocation was initializing JVM, and initializing network connections, just to do one small send operation. A more efficient approach was to run signal-cli in daemon mode and talk to it via DBUS or JSON-RPC.

Also in January 2021, I upgraded the integration to talk to signal-cli over DBUS. This took some tinkering to figure out the DBUS interface configuration and to get python code to talk to it. But it worked, and message delivery was now much quicker.

In December 2021, signal-cli added the JSON-RPC interface, and I switched the Healthchecks integration to it. Again, it took a fair bit of tinkering and support from the signal-cli author until I figured out how it all hangs together, how to read and write messages over a UNIX socket, and how to interpret them. There were two important improvements over the previous DBUS code:

  • Simpler operations: I did not need the DBUS service with its associated configuration files anymore.
  • The Healthchecks project did not need the “dbus-python” dependency anymore.

Rate limiting and CAPTCHAs

Around April 2022 I started to notice that some send operations were failing with an error message asking to solve a CAPTCHA challenge. These errors were infrequent at first and seemed to only affect the very first messages to new recipients. I added code to email me the CAPTCHA challenges, and I added a crude command-line utility to submit the CAPTCHA solutions. As the CAPTCHA challenges came in, I manually solved and submitted them. Signal was using Google reCAPTCHA, and I got plenty of opportunities to demonstrate my intelligence by expertly clicking on fire hydrants, crosswalks, and traffic lights. Sometimes at odd hours, sometimes roadside over a mobile hotspot.

As the frequency of CAPTCHAs gradually increased, I tried to make solving them less annoying:

  • I figured out that being logged in gmail.com helps the CAPTCHA solving a lot. Usually just a single click, no fire hydrants.
  • I set up my computer to automatically put the CAPTCHA solution in the clipboard.
  • I made a web form for submitting CAPTCHA solutions. No need to fire up the terminal, just click a link in the email, and paste the solution.

Now solving a CAPTCHA challenge took just a few clicks, but the end-user experience was still not great. For some users, Signal notifications would not work until I showed up and solved yet another CAPTCHA. I did some spelunking in the signal-server code base. There is a class listing various rate-limiters and their parameters. For any rate limiter, I could trace back where and how it was used. But I still could not pinpoint the piece of code that triggers the specific rate-limit errors I was seeing. Signal-Server has an “abusive-message-filter” module, which is private code, perhaps the logic lives there.

It seemed only the initial messages to new recipients were triggering rate-limit errors. After a single message got through, the following messages would work with no issues. So my next idea was to change the Signal integration onboarding flow:

  • After the user has entered the phone number of the Signal recipient, ask them to send a test message
  • If the test message generates a rate-limit error, ask the user to initiate a conversation with us from their side, then try again:

My working theory is that users initiating the conversation with Healthchecks will look less abusive to Signal’s abusive message filter, and will help avoid hitting rate limits. But, if the theory fails and we still get rate-limit errors, at least the users will not create dysfunctional integrations (the “Save Integration” button becomes available only after successfully sending a test message).

In Summary

In summary, Healthchecks uses signal-cli to send Signal messages. It talks to signal-cli over JSON-RPC. To avoid rate limits, it asks the user to send the first message from their end. Building and maintaining the Signal integration has taken more effort than any other integration. But that is fine and, aside from the manual CAPTCHA solving, time well spent. I’m glad Healthchecks supports it, and I’m happy to see that the Signal integration is popular among Healthchecks.io users.

Happy monitoring and messaging,
Pēteris