How Healthchecks.io Sends Webhook Notifications
Webhooks are a powerful way to notify external systems about checks changing state in Healthchecks.io. Webhook notifications are available to all user accounts, paid and free.
Webhooks were the second notification method supported by Healthchecks (the first one was email). The webhook delivery code started as a simple requests.get(user_supplied_url)
and evolved. Today, the webhook integration in Healthchecks supports:
- HTTP GET, POST, and PUT requests with user-defined request bodies.
- User-defined request headers.
- Placeholder values like $NAME and $STATUS that can be used in the URL, the headers, or the request body.
- Separate webhook configurations for “check goes up” and “check goes down” events.
- Retries when requests time out or return non-2xx status code.
In terms of implementation, none of the above is super complicated. When the user sets up a webhook integration, we collect the webhook configuration. When it is time to send a notification, we assemble the URL, the headers, and the request body, and pass them to our HTTP client library of choice. But two security-related aspects are a little more interesting:
- We want to prevent webhook requests from accessing private IP addresses (10.x.x.x, 192.168.x.x, …).
- Webhook targets can sometimes take a long time to respond. One user’s slow notifications should not block or delay another user’s normal notifications.
Private IP Addresses
Malicious users can set up webhook URLs to tamper with resources in the Healthchecks.io internal network. They can also set up DNS records that resolve to private IP addresses. So it is not enough to check for private IP ranges in webhook URLs using e.g. regular expressions.
I switched Healthchecks to using pycurl for making outbound HTTP requests. pycurl is a Python wrapper for libcurl, and libcurl lets you specify a CURLOPT_OPENSOCKETFUNCTION callback function. This function receives an IP address after DNS resolution, and can decide whether to connect to it or not.
Healthchecks has a site-wide configuration setting for enabling/disabling webhook requests to private IP addresses. This setting is disabled on the hosted service at Healthchecks.io. Operators of self-hosted Healthchecks instances, on the other hand, sometimes specifically need webhooks to access services running inside their internal network, and they can enable it.
When migrating Healthchecks from requests to pycurl, I wrote a wrapper for pycurl that mimics the requests API, and thus could be used as a drop-in replacement. It does not cover the full functionality of requests, but it does cover the functionality that Healthchecks uses.
Slow Webhook Targets
Users can set up webhooks to targets that take a long time to respond, and then generate frequent notifications to these targets. Doing so would keep the notification-sending process busy and delay notifications for all other users. Users could do this maliciously, but this could also happen (and has happened) unintentionally.
The first obvious mitigation was to implement a time budget for each webhook delivery: if a webhook delivery (including retries) takes too long, we abort it.
Another mitigation was to prioritize notifications to integrations with lower historic send times. If we have multiple deliveries lined up, start with the quick ones, and do the slow ones last.
The notification sender is implemented as a Django management command (“manage.py sendalerts”). A simple way to increase sending capacity would be to run multiple “sendalerts” processes concurrently. This works, but each process needs at least one database connection. I am not running PgBouncer (and want to delay introducing new infrastructure pieces for as long as possible), so I cannot go too crazy with many concurrent “sendalerts” processes.
A few weeks ago I completed work on another idea to increase the sending capacity. The “sendalerts” process now uses multiple worker threads to send notifications. The worker threads share database connections using psycopg3 connection pool, which Django recently added support for. There can be more worker threads than database connections available in the pool, but the worker threads are programmed to return DB connections to the pool before potentially long network IO operations, allowing other threads to advance. With an appropriately set worker count, this allows hundreds of in-progress webhook requests while using only a few DB connections.
After implementing the worker threads, I removed the prioritization by historic send time. I also increased the timeout value for outbound HTTP requests as now I could afford to! The timeout is currently set to 30 seconds, and Healthchecks retries failed requests up to 2 times. So a single delivery can take up to 3 * 30 = 90 seconds.
Closing Notes
Healthchecks.io now uses the threaded notification sender for delivering all notification types, not just webhooks. There are integration types other than webhooks that are sometimes slow. For example, Signal and MS Teams notifications sometimes take multiple seconds to complete. The above changes benefit all integration types, not just webhooks. Webhooks, however, are the most risky, as they can be fully configured by users.
Thanks for reading,
–Pēteris