Codeship Security Incident’s Impact on Healthchecks.io

Recently, Codeship reported a security incident: a database containing their production data had been exposed for over a year.

I have a Codeship account. I also have a Bitbucket account, and my Codeship account was authorized to access it. In the Bitbucket account, I have a private repository with various secrets (API keys, access tokens) that Healthchecks.io production environment uses to talk with the database and external services.

I was storing the secrets in the Bitbucket in an unencrypted form. If leaked, they would allow an attacker to do things like:

  • send emails impersonating Healthchecks.io (bad)
  • send SMS messages from Healthchecks.io number (bad, can get expensive)
  • access Healthchecks.io customers’ billing addresses (very bad)

I have found no evidence that my private Bitbucket account was accessed or any secrets leaked. At the same time, I cannot conclusively prove it was not possible for the secrets to leak.

I debated with myself whether I should write this post. It boils down to a moral dilemma: do I write a full disclosure about what is likely a non-event, and risk a reputation hit? Or do I handle it quietly and just let it pass?

Codeship and Bitbucket

I found Codeship’s security notification on October 2 in my spam folder. According to Codeship, their database was exposed from June 2019 to June 2020. There is evidence that the attacker was actively using data from their database.

I am not using Codeship for anything Healthchecks-related, but I had granted Codeship access to my Bitbucket account for an unrelated project. The grant was required to set up automatic Codeship builds on each Bitbucket commit. Unfortunately, it looks like Codeship asks for way too many permissions:

I wish I had read this carefully and thought about the implications when I was setting this up back in January 2016.

After granting access, Bitbucket gives Codeship an OAuth access token, which ends up in Codeship’s database. Using the OAuth token, Codeship (or its attacker) can access and manipulate the user’s repositories.

After becoming aware of the incident, I revoked Codeship’s access from Bitbucket, and checked all my repositories for any rogue access keys or unexpected changes. I asked Codeship support if the Bitbucket access token could have been exposed. They said:

The Bitbucket token was potentially exposed. I would recommend revoking this token and looking through your repo history for any suspicious activity, although so far we have not heard of any Bitbucket-focused activity.

I asked Bitbucket support for API call logs. They did prepare and send a report, but unfortunately they only have data from the previous 3-4 weeks.

Healthchecks.io Secrets in Bitbucket

In Bitbucket, I have a private repository with deployment scripts. These are Fabric scripts for bootstrapping new machines, deploying and updating software, and various maintenance tasks. The repository also contains a local_settings.py file with all the production API keys and tokens the Healthchecks app needs to run.

Storing unencrypted secrets in version control is, of course, a mistake on my part. I have now fixed this and am now using Mozilla’s sops to encrypt the secrets. The GPG encryption key is on a Yubikey.

Rotating Secrets

Rotating all secrets was the most time consuming and stress-inducing part. Most of the secrets I had never rotated before, so I had to figure out a safe update procedure for each.

Database. One of the secrets was the PostgreSQL database password. Access to the database is also restricted at the firewall and the pg_hba level, so the password alone is not enough to access the database. Of course, I still wanted to change the password. The procedure was:

  • Create a secondary database user
  • Switch all services to use the secondary user
  • Update the password of the primary database user
  • Switch all services back to use the primary user

Sounds simple enough, but it involved lots of planning, checking, and double-checking. White knuckles and lots of coffees.

Services that support seamless key rotation: AWS SMTP credentials, Braintree, Matrix, Pushbullet, Sentry, Twilio. The process:

  • Create a secondary API key while the primary key still works
  • Update all services to use the secondary key
  • Promote the secondary token to the primary (or delete the primary, leaving just one)

I updated the keys one-by-one, checking everything after each step. In some cases, I kept the old key active for a couple days to make sure no production system was still using it.

Services that only support resetting the keys in place: Discord, LINE Notify, OpsDash, Pushover, Slack, Telegram. For these, the process is “quick hands.” Regenerate the secret, then update the production machines with the new secret as quickly as possible. Again, going through them one by one and testing everything after each change.

SSL certificates. I had the SSL certificates for healthchecks.io, hc-ping.com, and hchk.io purchased from Namecheap. They do support reissuing certificates and revoking the old ones. One complication here was that, for each hostname, I was using both RSA and ECDSA certificates (provisioned as instructed by Namecheap: issue RSA first, then reissue as ECDSA). To minimize the chance of making a mistake, I decided to purchase and deploy new certificates from a different provider and only afterwards revoke the existing certificates. I discovered SSLMate, and got the certificates from them. I liked the process of ordering certificates from the command line. It’s a big improvement over the usual process of copy-pasting CSRs around. I installed the new certificates, tested the setup thoroughly, and revoked the old certificates a few days later.

There is still one task on my list: Django’s SECRET_KEY setting. I’m investigating the impact of changing it. One issue here is that Healthchecks.io uses SECRET_KEY as one component in a hash function when generating the badge URLs. Naively changing SECRET_KEY would invalidate all existing badge URLs. Coupling badge URL generation and SECRET_KEY was a seemingly small design decision, and now it comes biting me in the rear years later! I will deal with this but wanted to get this post out with no further delay.

Closing Thoughts

As I wrote in the beginning, I have no evidence that any Healthchecks.io secrets have been leaked or used. Still, they should have not been accessible to third parties (Bitbucket and Codeship) in the first place – I’ve now fixed this mistake.

As always, if you have any questions, please write to contact@healthchecks.io.

– Pēteris, Healthchecks.io

About Tracking Cookies on status.healthchecks.io

status.healthchecks.io used to set an “ajs_anonymous_id” tracking cookie. I’m happy to report that it does not do that anymore since September 22, 2020. In this post, I’ll share the process I went through to get the tracking cookie removed.

For powering status.healthchecks.io, I am using a third-party hosted status page provider, Statuspage.io, by Atlassian. I initially set it up in May 2020 and wrote about it on this blog. After the setup, while poking around, I discovered my fancy new status page sets a tracking cookie. It does not ask for the user’s consent, and it does not obey the “DNT” header – when you visit the page, you get a tracking cookie. 

I believe this cookie was only used for innocuous purposes (tracking the number of unique page visitors), but it still invades site visitors’ privacy and violates GDPR requirements. On May 7, I submitted a support ticket asking to remove the tracking cookie and got a reply with a bottom line: “We can’t avoid setting these cookies.” After asking again, I got back a non-commital “I will forward this to our product team and development team,” and that was that. 

I had already invested a significant amount of time setting up automation and custom metrics for the status page. And, aside from the cookie issue, I was generally happy with the product. Before switching providers over this one issue, I wanted to take a crack at fixing it. It was unlikely Atlassian would spend any engineering resources just because a single $29/mo customer had complaints. So I needed to bump up the priority of the issue. I searched around for other Statuspage.io customers and started contacting them. My email template went through several iterations until I got to a version that felt transparent and not manipulative:

Subject: Cookies on status.somedomain.com
Hello,

when I visit status.somedomain.com I see it stores the following cookies in my browser:

* ajs_anonymous_id
* ajs_group_id

These are Atlassian’s tracking cookies. They are not essential, and so under GDPR they require the user’s explicit opt-in before they can be sent to the browser.

I am an Atlassian Statuspage customer myself, and my service’s status page has the exact same problem. I’ve contacted Atlassian about this but this appears to be low priority for them.

I am contacting you because I think more affected customers being aware of the issue and asking Atlassian to fix it = higher chance that they will actually do something.

Thanks,
Pēteris Caune

I started by manually sending ten or so emails out every week. I mostly got sympathetic and cooperative responses. There were some funny ones too. For example, one guy insisted that there is no problem because he could not reproduce the issue using “internal methods.” Me showing him the results of several different cookie scanning services (cookie-script.com, cookiebot.com) did not sway him.

I kept contacting other companies, and they sometimes forwarded me the responses they were getting from Atlassian. From these responses, it didn’t look like we were making much progress. In July, two months in, I decided to amp things up. I grabbed the Majestic Million dataset with the top million websites. I wrote a script that goes through the list, and, for each website, checks if it has an Atlassian-operated “status” subdomain. The script produced an HTML page with filtered results and “mailto:” links, to help me send out the emails. Side note: did you know the “mailto:” links can specify the message body?

To find email addresses, I found the best way was to look at each website’s privacy policy and search for the “@” symbol. I found typical contact addresses were privacy@somedomain.com and dpo@somedomain.com (where “dpo” stands for Data Protection Officer). On July 26-27, one by one, I sent out emails to around 200 companies.

The wave of new support tickets from various companies worked. Atlassian started communicating back a plan to implement a cookie consent banner in Q1 2021. Later in August, they started saying “late September 2020”. I held off from sending more emails and waited to see what would happen in September.

On September 22, I received an update from Atlassian. Instead of implementing a cookie consent banner, they decided to drop the Page Analytics feature, which was responsible for the tracking cookie. From my point of view, this is the best possible outcome – no tracking cookie and no consent banner. Statuspage.io still has an option of adding a Google Analytics tag. So, there still is a way to track the unique visits for those who need it. 

Thank you, Atlassian / Statuspage.io, for implementing this change. I appreciate it! To my contact at Atlassian support, thank you for your patience. 

To everyone who also contacted Atlassian about the tracking cookies, thank you! It took a team effort, but it worked out in the end!

– Pēteris

Database Fail-over on May 20

On May 20, the primary database server of Healthchecks.io experienced packet loss and latency issues. To ensure normal operation of the service, the database was failed-over to a hot standby. The fail-over went fine, aside from a minor issue with one fat-fingered firewall rule. Below are additional details about the network issue and the fail-over process.

The Network Issue

The symptoms: network latency and packet loss on the database host shoots up for 1-2 minutes, then everything goes back to normal. This had occurred a few times in the past weeks already, causing ping processing delays and subsequent “X is down” alerts each time.

This issue has been hard to troubleshoot because it seemed to happen at random times, and lasted only for a couple minutes each time. I was running a few monitoring tools continuously: Netdata agent including the fping plugin, OpsDash agent, and mtr in a loop logging to a text file. I could also inspect application logs for clues.

To illustrate the elevated latency, here’s how pinging 8.8.8.8 from the host looks normally:

$ ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=57 time=4.93 ms
64 bytes from 8.8.8.8: icmp_seq=2 ttl=57 time=4.96 ms
64 bytes from 8.8.8.8: icmp_seq=3 ttl=57 time=4.94 ms
64 bytes from 8.8.8.8: icmp_seq=4 ttl=57 time=4.95 ms
(…)

And here’s the same command during the problematic 2-minute window:

$ ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=57 time=104 ms
64 bytes from 8.8.8.8: icmp_seq=3 ttl=57 time=104 ms
64 bytes from 8.8.8.8: icmp_seq=4 ttl=57 time=102 ms
64 bytes from 8.8.8.8: icmp_seq=5 ttl=57 time=104 ms
64 bytes from 8.8.8.8: icmp_seq=6 ttl=57 time=104 ms
64 bytes from 8.8.8.8: icmp_seq=7 ttl=57 time=104 ms
64 bytes from 8.8.8.8: icmp_seq=8 ttl=57 time=104 ms
64 bytes from 8.8.8.8: icmp_seq=9 ttl=57 time=100 ms
64 bytes from 8.8.8.8: icmp_seq=11 ttl=57 time=104 ms
64 bytes from 8.8.8.8: icmp_seq=12 ttl=57 time=104 ms
64 bytes from 8.8.8.8: icmp_seq=13 ttl=57 time=104 ms
64 bytes from 8.8.8.8: icmp_seq=14 ttl=57 time=104 ms
64 bytes from 8.8.8.8: icmp_seq=16 ttl=57 time=104 ms
^C
--- 8.8.8.8 ping statistics ---
16 packets transmitted, 13 received, 18% packet loss, time 14996ms
rtt min/avg/max/mdev = 100.713/104.111/104.729/1.207 ms

Impact

When the latency from app servers and the database suddenly goes from 0.5ms to 100ms, there are going to be issues. The most pressing issue was the processing of incoming pings. On app servers, each received ping is put in a queue. A single worker process (a goroutine, to be exact) takes items from the queue and inserts them in the database. This is a sequential process and the latency to the database puts a limit on how many pings can be processed per second. To illustrate, under normal operation the worker can process all incoming pings without a backlog building up:

May 22 13:06:00 www1 hchk[13236]: 47 pings/s
May 22 13:06:01 www1 hchk[13236]: 105 pings/s
May 22 13:06:02 www1 hchk[13236]: 328 pings/s
May 22 13:06:03 www1 hchk[13236]: 265 pings/s
May 22 13:06:04 www1 hchk[13236]: 108 pings/s
May 22 13:06:05 www1 hchk[13236]: 72 pings/s

During the high-latency period, throughput drops significantly and backlog starts to build up (timestamps in CEST):

May 20 17:39:49 www1 hchk[21015]: 5 pings/s, queued 47, dwell time 2373ms
May 20 17:39:50 www1 hchk[21015]: 5 pings/s, queued 68, dwell time 3358ms
May 20 17:39:51 www1 hchk[21015]: 6 pings/s, queued 104, dwell time 3696ms
May 20 17:39:52 www1 hchk[21015]: 3 pings/s, queued 180, dwell time 4624ms
May 20 17:39:53 www1 hchk[21015]: 5 pings/s, queued 241, dwell time 5268ms
May 20 17:39:54 www1 hchk[21015]: 3 pings/s, queued 268, dwell time 6116ms
May 20 17:39:55 www1 hchk[21015]: 4 pings/s, queued 292, dwell time 6975ms
May 20 17:39:56 www1 hchk[21015]: 5 pings/s, queued 325, dwell time 7819ms
May 20 17:39:57 www1 hchk[21015]: 3 pings/s, queued 340, dwell time 8445ms

When the dwell time (the age of the oldest item in the queue) goes above 15 seconds, the app server declares itself unhealthy. When all app servers are unhealthy, clients start getting “HTTP 503 Service Unavailable” errors.

Even if the network hiccup is brief, and no client request gets rejected, there is still a negative consequence: there is a delay from the client making a HTTP request, to the ping getting registered in the database. For a check with a tight timing, this additional delay can push it over the limit and cause a false alert.

Fail-over

On May 20, after another latency spike, I decided to fail-over the database to a hot standby. I didn’t know what was ultimately causing the network issues, but the hope was that moving to a different physical host will make things better.

The fail-over procedure is simple:

  • make sure there is no replication lag
  • stop the primary
  • promote the hot standby

The application servers didn’t need to be updated or restarted – they knew IP addresses of both database hosts, and could automatically cycle between them until they found one that was alive and was accepting writes. More details on this in PostgreSQL documentation: Specifying Multiple Hosts and target_session_attrs.

The only issue I ran into during the fail-over was with a firewall rule denying one of the app servers access: I had mistyped port “5432” as “4322”, which was an easy fix.

Preparing a New Hot Standby

The service was back to normal operation, but the database host was now lacking a hot standby. Luckily, I had a fresh system (sporting Ryzen 3700X) already ordered and waiting on the sidelines. I used provisioning scripts to set up a hot standby on it, but checking and double-checking everything still took close to an hour.

Contacting Hetzner Support

I tried contacting Hetzner Support about the packet the loss and latency jump. I provided mtr reports (which they always ask for), and ping logs showing latency. After some forth-and-back ultimately the response I got was “There is and was no general network issue known. Also we do not see any packet loss from here.”

To be fair, this could have been an issue with the hardware or software on the physical host – I can’t say for sure.

Workarounds: Parallel Processing, Batching, Dedicated Queue?

As mentioned earlier, each app server processes pings one-by-one, over a single database connection. When a single operation blocks, the whole queue gets stuck. What if we used a connection pool? Or processed pings in batches – say, 100 at a time? Or added a dedicated queue component to the mix?

These are all interesting ideas worth of consideration. I’ve already done some work on the batch processing idea and have promising results. That being said, packet loss would be problematic even with these workarounds. It would need to be solved anyway.

As always, I will continue to look for opportunities to make the service more robust.

Pēteris,
Healthchecks.io

Healthchecks.io Status Page Facelift

The Healthchecks.io system status page at status.healthchecks.io recently received a revamp. Here are my notes on the new version.

First up, the components section shows the current and historic status of components:

Dashboard shows the status of the main website, healthchecks.io. “Operational” state here means the website responds to HTTP requests, and has a working connection to the PostgreSQL database. Checkly, an external uptime monitoring service, monitors the website and automatically updates the component’s status via Statuspage.io API. Checkly has powerful and flexible webhook notifications which makes this possible.

Ping API shows the status of the ping endpoint, hc-ping.com. “Operational” state means hc-ping.com is responding to HTTP requests and is inserting pings in the database with no excessive delay. Although the ping endpoint and the main website runs on the same physical servers, they use different software. So it makes sense to monitor them separately. The status of this component is updated automatically by Checkly, same as the dashboard.

Notification Sender shows the status of the background process sending out notifications. Status updates of this component are not automated yet.


Below the components is the “System Metrics” section with four metrics.

  • Processed pings is the number of valid ping requests (valid UUID, not rate limited) processed per second.
  • Queued incoming pings is the number of pings that have been received but not yet inserted in the database. A spike suggests either a database problem, or a connectivity issue inside the data center.
  • Notifications sent is the number of notifications sent per minute.
  • Queued outgoing notifications is the number of scheduled notifications waiting to be sent out. A growing number means either the notification sender is not working, or it cannot keep up.

I used the following criteria for picking the metrics to show:

  • The metric should tell something useful about the system.
  • The metric should be simple to explain. For example, I internally track a few different “queue dwell time” metrics. They are useful, but it would be hard to explain what they mean precisely, and how to interpret them.
  • The metric should be computationally inexpensive to measure. It should not require a heavy database query.

I considered several ways of measuring, aggregating and submitting the metrics. I ultimately went with:

  • Each web server exposes a metrics endpoint that an external system can scrape. Here’s a git commit where I added one of the endpoints.
  • On an external host, a script runs once per minute (via cron of course). It scrapes metrics data from each web server, then processes and submits it to Statuspage.io using their Metrics API. The script is less than 100 lines long.

If you notice gaps in metrics graphs, it could be because the external metrics collector has failed. There are ways to make the metrics collection more robust, but the current simple setup seems to work fine for now.


The final feature in the status page is Incidents. Currently I have not automated incident creation in any way. The plan is to manually open an incident when I become aware of it, and backdate it as makes sense. To test out the Incidents feature, I backfilled a couple past incidents. For example, Delayed notifications on February 7.

And that is all for now. I hope the new status page does not need to be used often! I will also keep posting outage notifications to @healthchecks_io on Twitter as well.

Happy monitoring and meta-monitoring,
Pēteris,
Healthchecks.io

Incident Report – 7 February 2020

On February 7, Healthchecks.io experienced an issue with sending notifications. An invalid cron expression slipped into the system, which caused the notification sending jobs to crash and restart in a loop. Timeline (all times are in UTC, and from February 7):

  • 0:37: a check with a bad cron schedule gets created via API
  • 0:41: the check receives its first ping
  • 0:42: one minute later, the notification senders go into a crash-restart loop
  • 1:02: external monitoring alerts go out
  • 7:00: I’ve woken up and found out about the outage
  • 7:28: The invalid cron expression is located and fixed, notification sending resumes
  • 7:52: I post a tweet about the outage
  • 10:00: Deployed mitigations for the “sendalerts” process repeatedly crashing, and stricter cron expression validity checks

This outage started at in the middle of night (2:42 AM local time) and so it took several hours until I found out about it and could jump on to fixing it. During this time, Healthchecks.io was not sending out any notifications (all types: emails, webhooks, Slack alerts, …). On the positive side, the web dashboard, the ping endpoints, the API and the badges were working normally.

After fixing the bad cron schedule, the notification senders resumed work and quickly went through the backlog of unsent notifications:

When notification sending resumed, Healthchecks.io sent out notifications for all checks that had flipped their state once (from “up” to “down”, or from “down” to “up”) during the outage. Unfortunately, it would have missed cases where a check flips twice (for example, “up” → “down” → “up”) during the outage window. If a check went down but came right back up during the outage window, Healthchecks.io missed it and didn’t send a notification.

The Root Cause

The “sendalerts” crash loop was tripping on the following cron schedule: “0 0 31 2 *”. Or, in human words, “at midnight of every February 31st“. The notification sender was crashing while calculating the next expected ping time for this schedule.

The Fix

  1. To get around the immediate crashing problem, I manually edited the problematic cron schedule
  2. In the “sendalerts” management command I added a mitigation for repeatedly crashing on the exact same check. With the mitigation, “sendalerts” postpones the problematic check for 1 hour, so it can process other checks in the meantime.
  3. I added extra validation step for cron expressions. Healthchecks now makes sure it can calculate a valid “next ping time” for a cron expression before allowing it into the system.
  4. When the outage started, I received monitoring alerts from three different services. All three alerts went to email, and I didn’t notice them until the morning. I’ve now updated notification settings to also receive Pushover notifications with the “Emergency” priority. These notifications override phone’s Do Not Disturb settings and repeat until acknowledged.

I apologize to all Healthchecks.io users for any inconvenience caused.

– Pēteris Caune,
Healthchecks.io

Comparison of Cron Monitoring Services (January 2020)

If you are looking for a hosted cron job monitoring service, good news: there many options to choose from! In this post I’m comparing a selection of the more popular ones: Cronitor, Healthchecks.io, Cronhub, Site24x7, CronAlarm, PushMon and Dead Man’s Snitch.

How I picked the services for comparison: I searched for “cron monitoring” on Google and picked the top results in their order of appearance. I was looking specifically for hosted, SaaS-style services.

Disclaimer: I run one of the services being compared, so I’m a biased source. In particular, choosing the axis of comparison is subjective, and of course I’m inclined to choose metrics that would make my service look good. When in doubt, do your own research!

Timeout-based schedules
Cron expression schedules
/fail endpoint
/start endpoint
Pinging via email
Team Features
Projects
Teams
Notification Methods
Email
Webhooks
Slack
SMS
Price per Month
For 1 cron jobfreefreefree$10freefreefree
For 10 cron jobs$24free$49$10$5$15$19
For 20 cron jobs$24free$49$10$20$25$19
For 40 cron jobs$24$20$49$10$20$100$19
For 80 cron jobs$79$20$99$10$20$100$19
Authentication Methods
Username and password
Google or Github
SSO (SAML2)
Company Metrics
Years in business54213588
Head count211????
Popularity in Slack App Directory4485

Timeout-based Schedules

Also called “simple” monitors, where the user specifies a period (for example, one hour). The client system must “check in” at least every period, otherwise the monitoring system raises an alert.

Timeout-based schedules are supported by every reviewed service except Site24x7.

Cron Expression Schedules

The user specifies a cron expression (for example, “0/5 * * * *”) and a timezone. The monitoring system calculates expected “check in” deadlines based on the cron expression.

Supported by: Cronitor, Healthchecks.io, CronHub, Site24x7.
Partially supported by: PushMon (uses non-standard syntax).
Not supported by: CronAlarm, Dead Man’s Snitch.

Fail Endpoint

The ability to explicitly signal a failure. This allows quicker failure notifications, without waiting before the configured timeout and grace time runs.

Supported by: Cronitor, Healthchecks.io, Cronhub, CronAlarm, PushMon, Dead Man’s Snitch.
Not supported by: Site24x7.

Start Endpoint

The ability to signal when a cron job execution starts. This enables the measurement and monitoring of the job’s run time.

Supported by: Cronitor, Healthchecks.io, Cronhub, Site24x7, CronAlarm.
Partially supported by: Dead Man’s Snitch (job’s runtime is reported on completion by a wrapper script).
Not supported by: PushMon.

Pinging Via Email

With this feature, clients can report their status by sending an email. This comes handy when integrating with a services that support email status reports and nothing else.

Supported by: Healthchecks.io, Dead Man’s Snitch
Not supported by: Cronitor, Cronhub, Site24x7, CronAlarm, PushMon.

Projects

The ability organize monitored cron jobs and their associated information by project. Project’s don’t matter much when you have only a few cron jobs to monitor, but become increasingly important as account’s size grows.

Supported by: Healthchecks.io, Site24x7 (“monitor groups”), PushMon (tagging), Dead Man’s Snitch (“cases”).
Not supported by: Cronitor, Cronhub, CronAlarm.

Teams

The ability to give your team members limited access to your account. The alternative would be to use a shared single-user account, which is of course not ideal.

Supported by: Cronitor, Healthchecks.io, Cronhub, Site24x7, Dead Man’s Snitch.
Not supported by: CronAlarm, PushMon.

Notification Methods

Notifications to email, Slack and custom webhooks is supported by all reviewed services. The support for other notification methods varies from service to service:

Email
Discord
Matrix
MS Teams
OpsGenie
PagerDuty
PagerTeam
PagerTree
Pushbullet
Pushover
Slack
SMS
Telegram
Trello
Twitter DM
VictorOps
Webhooks
WhatsApp

Authentication Methods

All reviewed services support classic authentication using username and password. Some of the services offer additional options:

  • Cronitor supports single-sign-on (SSO) using the SAML2 standard.
  • Healthchecks.io supports signing in via one-time sign in links to email.
  • Cronhub supports authentication using Github.
  • Site24x7, as a part of Zoho, supports a variety of single-sign-on options.

Years in Business

Site24x7 is the oldest company in the group with 13 years in business. Dead Man’s Snitch and PushMon are the second oldest, dating from 2011-2012. Cronitor, Healthchecks.io and CronAlarm were founded in 2014-2015. Cronhub is the youngest with 2 full years in business.

Head Count

Company size is a double edged sword. On one hand, larger companies seem like the safer option: they are less likely to shut down, and are more likely to have 24/7 staff monitoring their operations.

On the other hand, the smaller companies may have only a few people manning the systems, but they are passionate and committed. From my personal experience, when reporting problems to smaller companies, I’ve often had the issues fixed and a personal reply from a co-founder in hours.

The exact company size is usually not public information and I have only a few data points here:

  • Cronitor was started by two cofounders, August and Shane. I don’t know for sure but assume they are still a team of two.
  • Healthchecks.io is a one-man-band (the one man being me, Pēteris)
  • Cronhub is also a one-man-band, Tigran.

Popularity in Slack App Directory

Slack App Directory appears to be showing the apps by their popularity, and so can be used as an indirect way to compare real-world usage of the different services. I skimmed through the Developer Tools category and noted the positions:

  • Healthchecks.io and Cronitor are close by in search results on page 4.
  • Dead Man’s Snitch is on page 5.
  • Cronhub is on page 8.

In Closing

If you notice any factual errors, please let me know, and I’ll get them fixed ASAP!

There are many more things to compare (Do they have an API? Which country are they based in? What has their historic uptime been like? Which one has the prettiest landing page? …), but I decided to stop here. If you are shopping for a cron monitoring service, you will have to decide what is important for you, and likely do some additional research.

Happy monitoring,
– Pēteris

Fighting Packet Loss with Curl

One class of support requests I get at Healthchecks.io is about occasional failed HTTP requests to ping endpoints (hc-ping.com and hchk.io). Following an investigation, the conclusion often is that the failed requests are caused by a packet loss somewhere along the path from the client to the server. The problem starts and ends seemingly at random, presumably as network operators fix failing equipment or change the routing rules. This is mostly opaque to the end users on both ends: you send packets into a “black hole” and they come out at the other end – and sometimes they don’t.

One way to measure packet loss is using the mtr utility:

$ mtr -w -c 1000 -s 1000 -r 2a01:4f8:231:1b68::2
Start: 2019-10-07T06:25:42+0000
HOST: vams                              Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- ???                               100.0  1000    0.0   0.0   0.0   0.0   0.0
  2.|-- ???                               100.0  1000    0.0   0.0   0.0   0.0   0.0
  3.|-- 2001:19f0:5000::a48:129           46.3%  1000    1.7   2.2   1.0  28.9   2.5
  4.|-- ae4-0.ams10.core-backbone.com      4.0%  1000    1.1   2.5   1.0  43.7   4.6
  5.|-- ae16-2074.fra10.core-backbone.com  4.4%  1000    6.8   8.3   6.4  57.7   5.6
  6.|-- 2a01:4a0:0:2021::4                 4.7%  1000    6.8   6.8   6.4  26.4   2.2
  7.|-- 2a01:4a0:1338:3::2                 4.5%  1000    6.7  12.0   6.5 147.7  16.7
  8.|-- core22.fsn1.hetzner.com            4.4%  1000   11.6  16.4  11.4  84.9  14.2
  9.|-- ex9k1.dc14.fsn1.hetzner.com        5.2%  1000   16.7  12.4  11.5  47.4   3.5
 10.|-- 2a01:4f8:231:1b68::2               5.2%  1000   12.1  11.7  11.5  31.3   1.7

The command line parameters used:

-wPuts mtr into wide report mode. When in this mode, mtr will not cut hostnames in the report.
-c 1000The number of pings sent to determine both the machines on the network and the reliability of those machines. Each cycle lasts one second.
-s 1000The packet size used for probing. It is in bytes, inclusive IP and ICMP headers.
-rReport mode. mtr will run for the number of cycles specified by the -c option, and then print statistics and exit.

The last parameter is the IP address to probe. You can also put a hostname (e.g. hc-ping.com) there. The above run shows a 5.2% packet loss from the host to one of the IPv6 addresses used by Healthchecks.io ping endpoints. That’s above what I would consider “normal”, and will sometimes cause latency spikes when making HTTP requests, but the requests will still usually succeed.

Packet loss cannot be completely eliminated: there are always going to be equipment failures and human errors. Some packet loss is also allowed by IP protocol’s design: when a router or network segment is congested, it is expected to drop packets.


I’ve been experimenting with curl parameters to make it more resilient to packet loss. I learned that with enough brute force, curl can get a request through fairly reliably even at 80% packet loss levels. The extra parameters I’m testing below should not be needed, and in an ideal world the HTTP requests would just work. But sometimes they don’t.

For my testing I used iptables to simulate packet loss. For example, this incantation sets up 50% packet loss:

iptables -A INPUT -m statistic --mode random --probability 0.5 -j DROP    

Be careful when adding rules like this one over SSH: you may lose access to the remote machine. If you do add the rule, you will probably want to remove it later:

iptables -D INPUT -m statistic --mode random --probability 0.5 -j DROP

I made a quick bash script to run curl in a loop and count failures:

errors=0
start=`date +%s`

for i in {1..20}
do
    echo -e "\nAttempt $i\n"
    # This is the command we are testing:
    curl --retry 3 --max-time 30 https://hc-ping.com/6e1fbf8f-c17e-4749-af44-0c81461bdd19
    if [ $? -ne 0 ]; then
        errors=$((errors+1))
    fi
done

end=`date +%s`
echo -e "\nDone! Attempts: $i, errors: $errors, ok: $(($i - $errors))"
echo -e "Total Time: $((end-start))" 

For the baseline, I used the “–retry 3” and “–max-time 30” parameters: curl will retry transient errors up to 3 times, and each attempt is capped to 30 seconds. Without the 30 second limit, curl could sit for hours waiting for missing packets.

Baseline results with no packet loss:

👍 Successful requests20
💩 Failed requests0
⏱️ Total time4 seconds

Baseline results with 50% packet loss:

👍 Successful requests20
💩 Failed requests0
⏱️ Total time2 min 4 s

Baseline results with 80% packet loss:

👍 Successful requests13
💩 Failed requests7
⏱️ Total time17 min 43 s

Next, I increased the number of retries to 20, and reduced the time-per-request to 5 seconds. The idea is to fail quickly and try again, rinse and repeat:

curl --retry 20 -m 5 https://hc-ping.com/6e1fbf8f-c17e-4749-af44-0c81461bdd19

When using the –retry parameter, curl delays the retries using an exponential backoff algorithm: 1 second, 2 seconds, 4 seconds, 8 seconds, … This test was going to take hours so I added an explicit fixed delay:

curl --retry 20 -m 5 --retry-delay 1 https://hc-ping.com/6e1fbf8f-c17e-4749-af44-0c81461bdd19

Fast retries with 1 second retry delay and 80% packet loss:

👍 Successful requests15
💩 Failed requests5
⏱️ Total time18 min 18 s

Of the 5 errors, in 3 cases curl simply ran out of retries, and in 2 cases it aborted with the “Error in the HTTP2 framing layer” error. So I tried HTTP/1.0 instead. To make the results more statistically significant, I also increased the number of runs to 100:

curl -0 --retry 20 -m 5 --retry-delay 1 https://hc-ping.com/6e1fbf8f-c17e-4749-af44-0c81461bdd19

Fast retries over HTTP/1.0 with 80% packet loss:

👍 Successful requests98
💩 Failed requests2
⏱️ Total time51 min 3 s

For a good measure, I ran the baseline version again, now with 100 iterations. Baseline results:

👍 Successful requests75
💩 Failed requests25
⏱️ Total time60 min 22 s

Summary: in a simulated 80% packet loss environment, the “retry early, retry often” strategy clearly beats the default strategy. It would likely reach 100% success rate if I increased the number of retries some more.

Forcing HTTP/1.0 prevents curl from aborting prematurely when it hits the “Error in the HTTP2 framing layer” error.

Going from HTTPS to plain HTTP would likely also help a lot because of the reduced number of required round-trips per request. But trading privacy for potentially more reliability is a questionable trade-off.

From my experience, IPv6 communications over today’s internet are more prone to intermittent packet loss than IPv4. If you have the option to use either, you can pass the “-4” flag to curl and it will use IPv4. This might be a pragmatic choice in short term, but we should also keep pestering ISPs to improve their IPv6 reliability.


If you experience failed HTTP requests to Healthchecks.io, and fixing the root cause is outside of your control, adding the above retry parameters to your curl calls can help as a mitigation. Also, curl is awesome.

Happy curl’ing,
–Pēteris, Healthchecks.io

Preventing Office 365 ATP From Clicking Unsubscribe Links

Office 365 has a fancy optional feature called Advanced Threat Protection (ATP). It scans incoming emails for malware. It also opens any links in the emails and scans the contents of the links as well. Unfortunately, if an email message has an “Unsubscribe” link, ATP will “click” that link too and potentially unsubscribe the user.

The standard fix is to make sure a simple HTTP GET request does not unsubscribe the user. HTTP GET requests should be “safe” and side-effect free. To achieve one-click unsubscribe, my approach used to be:

  • A HTTP GET request to the unsubscribe URL serves a HTML form, and a tiny bit of JS to submit the form on page load
  • A HTTP POST request actually unsubscribes the user

This has been working fine with most email security scanners and link preview generator bots. But Office 365 ATP is more sophisticated than others: it scans URLs found in email messages by loading them in a full-blown browser, and the browser executes JS. This defeats my simple bot protection, and insta-unsubscribes Office 365 users with ATP enabled from Healthchecks.io notifications the moment they receive the first notification.

A simple solution would be to remove the auto-submit JS code, and always require a manual click from the user to confirm the unsubscribe. But I really didn’t want to give up the single-click unsubscribe functionality, and was looking for alternate solutions. I asked on StackOverflow and got an answer with several good ideas (thanks Adam!). I implemented the “timer” idea:

  • If the unsubscribe link is opened less than 5 minutes after sending the email, treat it as potential bot activity, and require manual confirmation (user/bot needs to do one extra click).
  • Otherwise, assume it’s human and auto-submit the form on page load.

From what I’ve seen in practice, ATP scans the links as soon as it receives the email. So it doesn’t receive the auto-submit JS code, and cannot execute it. I’ve verified with an affected user that this mitigation indeed seems to be working: since deploying the fix they have received several Healthchecks.io notifications, and ATP has not auto-unsubscribed them yet. So we’re good, at least until Office 365 changes tactics.

PS. There is also RFC 8058: “Signaling One-Click Functionality for List Email Headers”. It specifies the “List-Unsubscribe-Post” email header:

List-Unsubscribe: https://example.com/unsubscribe/opaquepart
List-Unsubscribe-Post: List-Unsubscribe=One-Click

This tells the email client: to unsubscribe without additional confirmation step, send HTTP POST to the URL in the “List-Unsubscribe” header. The email client can implement its own “Unsubscribe” function in its UI. For example, in Gmail you may see an “Unsubscribe” link next to sender’s address:

I’ve implemented this in Healthchecks.io, but I need an “unsubscribe” link in email’s footer too because people still look for it there.

Related discussion:

– Pēteris

Incident Report – 27 October 2019

On 27 October around 18:30 UTC one of the load balancers serving healthchecks.io, hc-ping.com and hchk.io started experiencing network issues. A monitoring script correctly detected this and removed the affected load balancer from DNS rotation, but some requests still got lost while the DNS changes propagated.

After receiving a monitoring alert I jumped in to investigate. The load balancers and backend servers are set up to communicate over a virtual private network (Hetzner’s vSwitch feature). The problematic load balancer was intermittently unable to reach the backend servers over their private IP addresses. Haproxy was reporting backends sporadically flipping between healthy / unhealthy states.

I opened a support request with Hetzner at 19:03 UTC, and in the meantime tried various things to fix the issue myself. Hetzner support confirmed they are dealing with an issue and at 22:21 UTC they reported the issue was resolved. The load balancer could reach all its peers over their vSwitch IP addresses again. I monitored connectivity for one more hour and then added the load balancer back to DNS rotation.

At the time of writing this, I don’t yet know what was the exact root cause was. The vSwitch configuration had worked with absolutely no problems for many months, and this incident came seemingly out of the blue. I requested more details from Hetzner support and so far they’ve said they are still analyzing this incident, but the vSwitch feature itself is considered “very stable”. I have not made up my mind whether to replace the vSwitch networking with something else. I don’t want to make hasty decisions, and will first wait for more information from Hetzner.

In summary, the good:

  • A monitoring script correctly detected a malfunctioning load balancer and removed it from DNS rotation.
  • Luckily I was able to investigate immediately (the incident happened on Sunday at 8:30PM local time).
  • Quick communication from Hetzner support: 7 updates from them in 3 hours.
  • And, what really matters in the end: they got the issue fixed.
  • The other load balancer remained fully functional the whole time. Some of the uptime monitoring services didn’t even notice an outage.

And the bad:

  • Some pings from client systems did get lost. This likely resulted in a number of false “X is down” alerts after the outage.

I apologize to all Healthchecks.io users for any inconvenience caused. For a monitoring service, any downtime is unacceptable, and I will continue to look for any opportunities to make the service more robust.

– Pēteris Caune,
Healthchecks.io

A Look at Healthchecks.io Hosting Setup, Summer 2019

For a monitoring service, uptime and reliability is of course a critical feature: customers are placing trust in the service to detect problems and deliver timely and accurate alerts. While I cannot guarantee that Healthchecks.io will absolutely never let you down, I can offer transparency on how it is currently being hosted and operated.

I will use bullet point lists liberally in this blog post (otherwise, it could turn into “The Opinionated and Frugal SRE Book”). Let’s go:

  • Main values: Simple is good. Efficient is good. Less is more.
  • The core infrastructure runs on Hetzner bare metal machines. Hetzner offers amazing value for money and is a big part of the reason why Healthchecks.io can offer its current pricing.
  • No containers, no auto-scaling, no “serverless”. Plain old servers, each dedicated to a single role: “load balancer”, “application server” and “database server”.
  • The machines are closer to “pets” than “cattle”: I have provisioning scripts to set up new ones relatively quickly, but in practice the working set of machines changes rarely. For example, the primary database server currently has an uptime of 375 days.

Load balancers

  • Hardware: Two Intel 9900K machines, costing €69/month each.
  • Software: HAProxy.
  • Fault tolerance: A watchdog script on an external VM, detects unhealthy load balancers and removes their DNS records.
  • Supports HTTP/2. Supports IPv6. Uses ECDSA certificates with modern clients and falls back to RSA certificates with old clients.
  • Used Loader.io to test performance-related configuration changes.

Before going HAProxy, I used Cloudflare Load Balancer. Compared to Cloudflare,

  • The monthly running cost ends up being slightly lower but similar.
  • It took more work to research, set up, and maintain the dedicated load balancer servers.
  • HTTP request latencies seen from clients are much more consistent. Cloudflare was great on some days, and not so great on others.
  • More control over the load balancer configuration, and more options to diagnose problems.

Application Servers

  • Hardware: Three Intel 6700/7700 machines, costing €39/month each.
  • Software: Nginx -> uWSGI -> Django.
  • Fault tolerance: load balancers detect unhealthy or “in maintenance mode” machines and remove them from rotation.
  • For daemon processes that need to be constantly running (“sendalerts”, “sendreports”) I use systemd services.
  • Communication with load balancers is over private network (Hetzner’s vSwitch).

Side note: when I started out, I took the private network idea to the extreme: I asked Hetzner support to install two NICs in each machine, and to connect the secondary NICs using a dedicated switch. A custom order like that takes more time (days instead of minutes) and costs more, but is possible. I later moved away from that setup, for a couple of reasons:

  • All servers had to be in the same rack. Cannot add more machines if the rack is full.
  • Also, with this setup, one critical switch failure could take out all of the servers.

Database

  • Hardware: Two Intel Xeon E3-1275 machines, ECC RAM, NVMe storage, costing €59/month each.
  • Software: PostgreSQL 10.
  • Primary/hot standby setup.
  • Fault tolerance: I am not brave enough to completely automate database failover – I have a tested procedure to perform the failover, but the decision has to be made manually.
  • Application servers specify multiple hosts in the database connection strings. When primary changes, the applications can fail over to the new primary without any additional configuration changes.
  • Backups: Full daily backups encrypted with GPG and uploaded to S3. I keep 2 months worth of backups and delete older ones.

Monitoring

  • OpsDash agent and hosted dashboard for a “big picture” view of the servers, and alerting.
  • Netdata agent on each machine for investigating specific issues.
  • Four VMs in different locations sending regular pings, and logging everything, including TCP packet captures.
  • Logs from the VMs aggregated to Papertrail. Papertrail sends alerts on specific log events.
  • An always-on laptop dedicated to showing live Papertrail logs.
My workplace, Papertrail logs in the top right screen

Ops

  • Yubikey for signing commits, logging into servers and decrypting database backups. For initial setup, I used DrDuh’s guide.
  • A small laptop with development and deployment environments, and another Yubikey. Travels with me when I leave home.
  • Fabric scripts for server provisioning, updates and maintenance.
  • Operated by one person. I’m not in front of my PC 100% of the time, so incidents can take time to fix.

Custom Bits

  • I maintain a private branch of the Healthchecks project with non-public customizations: branding, customized footer template, the “About”, “FAQ”, “Privacy Policy” pages and similar.
  • For serving the ping endpoints (hc-ping.com, hchk.io) efficiently, I wrote a small Go application, which I have not open sourced.

Last but not least, I am eating my own dog food and am monitoring many of the periodic and maintenance tasks using Healthchecks.io itself. For example, if the load balancer watchdog silently fails, I might not notice until much later when it fails to do its duty. With Healthchecks.io monitoring, if the watchdog runs into issues (the script crashes, VM loses network connectivity, anything) I receive an email and an SMS alert within minutes.

Thanks for reading,
– Pēteris, Healthchecks.io