Healthchecks.io

Two-factor Authentication

By Pēteris Caune / November 26, 2020 November 26, 2020

Healthchecks.io now supports two-factor authentication using the WebAuthn standard. Here is how it works: in the Account Settings page, users can see their registered FIDO2 security keys and register new ones:

When logging in, if the account has any registered keys, Healthchecks requires the user to authenticate with one of their keys:

Users can register multiple keys, users can give their keys nicknames, and users can remove registered keys. Removing the last key deactivates two-factor authentication. And, from the user’s perspective, for now at least, that’s mostly it!

There are some nuances on the UI side, and there are quite a few subtle things on the technical side to deal with. Here are a few examples.

If the user has just one registered security key, losing the key means losing access to their account. It is good to have a second, backup key and store it separately. I added a note in the UI about that:

When the user removes their last security key, they are effectively also disabling the two-factor authentication for their account, and should be aware of it:

In a high-risk situations (add security key, remove security key, change email address, change password, close account) the service should require the user to re-authenticate. My solution here is to send a six-digit confirmation code to the user’s email and require the user to enter it back.

When the user enters the correct code, they can continue to the sensitive action, and will not be asked to enter another code for the next 30 minutes. The code entry form uses rate limiting to prevent brute-force attacks.

For implementation, I used the fido2 Python library by Yubico. They provide a sample Relying Party implementation, which I used as a reference. Yubico also provides a WebAuthn Developer Guide. It is a good resource with the right level of detail, and I ended up reading and re-reading it multiple times.

To clear up specific questions (“What are the requirements for user handle?“, “What the relying party identifier should look like?“, …) I had to look at the W3C WebAuthn specification a few times as well.

This is a preliminary implementation. I’ve personally tested it with several types of security keys on Firefox and Chrome. If you experience any issues with registering or authenticating with your security key(s), please report it!

Will there be support for other 2FA methods: SMS, TOTP?

SMS – no. TOTP – potentially, if there is significant demand for it.

That’s all for now, thanks for reading!
Pēteris,
Healthchecks.io

Using Github Actions to Run Django Tests

By Pēteris Caune / November 23, 2020 November 23, 2020

I recently found out Travis CI is ending its free-for-opensource offering, and looked at the alternatives. I recently got badly burned by giving an external CI service access to my repositories, so I am now wary of giving any service any access to important accounts. Github Actions, being a part of Github, therefore looked attractive to me.

I had no experience with Github Actions going in. I have now spent maybe 4 hours total tinkering with it. So take this as “first impressions,” not “this is how you should do it.” I’m a complete newbie to Github Actions, and it is just fun to write about things you have just discovered and are starting to learn.

My objective is to run the Django test suite on every commit. Ideally, run it multiple times with different combinations of Python versions (3.6, 3.7, 3.8) and database backends (SQLite, PostgreSQL, MySQL). I found a starter template, added it in .github/workflows/django.yml, pushed the changes, and it almost worked!

The initial workflow definition:

name: Django CI

on:
  push:
    branches: [ master ]
  pull_request:
    branches: [ master ]

jobs:
  build:

    runs-on: ubuntu-latest
    strategy:
      max-parallel: 4
      matrix:
        python-version: [3.6, 3.7, 3.8]

    steps:
    - uses: actions/checkout@v2
    - name: Set up Python ${{ matrix.python-version }}
      uses: actions/setup-python@v2
      with:
        python-version: ${{ matrix.python-version }}
    - name: Install Dependencies
      run: |
        python -m pip install --upgrade pip
        pip install -r requirements.txt
    - name: Run Tests
      run: |
        python manage.py test

And the results:

Everything looked almost good except for a couple missing dependencies. A couple of dependencies for optional features (apprise, braintree, mysqlclient) are not listed in requirements.txt, but are needed for running the full test suite. After adding an extra “pip install” line in the workflow, the tests ran with no issues.

Adding Databases

If you run the Healthchecks test suite with its default configuration, it uses SQLite as the database backend which usually Just Works. You can tell Healthchecks to use PostgreSQL or MySQL backends instead by setting environment variables.

Looking at Github Actions documentation suggested I should use service containers. By using special syntax, you tell Github Actions to start a database in a Docker container before the the rest of the workflow execution starts. You then pass environment variables with the database credentials (host, port, username, password) to Healthchecks. It took me a few failed attempts to get running, but I got it figured out relatively quickly:


name: Django CI

on:
  push:
    branches: [ master ]
  pull_request:
    branches: [ master ]

jobs:
  build:

    runs-on: ubuntu-20.04
    strategy:
      max-parallel: 4
      matrix:
        db: [sqlite, postgres, mysql]
        python-version: [3.6, 3.7, 3.8]
        include:
          - db: postgres
            db_port: 5432
          - db: mysql
            db_port: 3306

    services:
      postgres:
        image: postgres:10
        env:
          POSTGRES_USER: postgres
          POSTGRES_PASSWORD: hunter2
        options: >-
          --health-cmd pg_isready
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5
        ports:
          - 5432:5432
      mysql:
        image: mysql:5.7
        env:
          MYSQL_ROOT_PASSWORD: hunter2
        ports:
          - 3306:3306
        options: >-
          --health-cmd="mysqladmin ping"
          --health-interval=10s
          --health-timeout=5s
          --health-retries=3
    steps:
    - uses: actions/checkout@v2
    - name: Set up Python ${{ matrix.python-version }}
      uses: actions/setup-python@v2
      with:
        python-version: ${{ matrix.python-version }}
    - name: Install Dependencies
      run: |
        python -m pip install --upgrade pip
        pip install -r requirements.txt
        pip install braintree mysqlclient apprise
    - name: Run Tests
      env:
        DB: ${{ matrix.db }}
        DB_HOST: 127.0.0.1
        DB_PORT: ${{ matrix.db_port }}
        DB_PASSWORD: hunter2
      run: |
        python manage.py test

And here is the timing of a sample run:

Making It Quick

One thing that bugged me was the database containers took around one minute to initialize. Additionally, both the PostgreSQL and MySQL would initialize on all jobs, even the jobs only needing SQLite. This is not a huge issue, but, my inner hacker still wanted to see if the workflow can be made more efficient. With a little research, I found the Github Actions runner images come with various preinstalled software. For example, the “ubuntu-20.04” image I was using has both MySQL 8.0.22 and PostgreSQL 13.1 preinstalled. If you are not picky about database versions, these could be good enough.

I also soon found the install scripts Github uses to install and configure the extra software. For example, this is the script used for installing postgres. One useful piece of information I got from looking at the script is: it does not set up any default passwords and does not make any changes to pg_hba.conf. Therefore I would need to take care of setting up authentication myself.

I dropped the services section and added new steps for starting the preinstalled databases. I used the if conditionals to only start the databases when needed:

name: Django CI

on:
  push:
    branches: [ master ]
  pull_request:
    branches: [ master ]

jobs:
  test:

    runs-on: ubuntu-20.04
    strategy:
      matrix:
        db: [sqlite, postgres, mysql]
        python-version: [3.6, 3.7, 3.8]
        include:
          - db: postgres
            db_user: runner
            db_password: ''
          - db: mysql
            db_user: root
            db_password: root

    steps:
    - uses: actions/checkout@v2
    - name: Set up Python ${{ matrix.python-version }}
      uses: actions/setup-python@v2
      with:
        python-version: ${{ matrix.python-version }}
    - name: Start MySQL
      if: matrix.db == 'mysql'
      run: sudo systemctl start mysql.service
    - name: Start PostgreSQL
      if: matrix.db == 'postgres'
      run: |
        sudo systemctl start postgresql.service
        sudo -u postgres createuser -s runner
    - name: Install Dependencies
      run: |
        python -m pip install --upgrade pip
        pip install -r requirements.txt
        pip install apprise braintree coverage coveralls mysqlclient
    - name: Run Tests
      env:
        DB: ${{ matrix.db }}
        DB_USER: ${{ matrix.db_user }}
        DB_PASSWORD: ${{ matrix.db_password }}
      run: |
        coverage run --omit=*/tests/* --source=hc manage.py test
    - name: Coveralls
      if: matrix.db == 'postgres' && matrix.python-version == '3.8'
      run: coveralls
      env:
        GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

And the resulting timing:

There is more time to be gained by optimizing the “Install Dependencies” step. Github Actions has a cache action which caches specific filesystem paths between job runs. One could figure out the precise location where pip installs packages and cache it. But this is where I decided it was “good enough” and merged the workflow configuration into the main Healthchecks repository.

In summary, first impressions: messing with Github Actions is good fun. The workflow syntax documentation is good to get a quick idea of what is possible. The workflow definition ends up being longer than the old Travis configuration, but I think the extra flexibility is worth it.

Codeship Security Incident’s Impact on Healthchecks.io

By Pēteris Caune / October 22, 2020 October 22, 2020

Recently, Codeship reported a security incident: a database containing their production data had been exposed for over a year.

I have a Codeship account. I also have a Bitbucket account, and my Codeship account was authorized to access it. In the Bitbucket account, I have a private repository with various secrets (API keys, access tokens) that Healthchecks.io production environment uses to talk with the database and external services.

I was storing the secrets in the Bitbucket in an unencrypted form. If leaked, they would allow an attacker to do things like:

send emails impersonating Healthchecks.io (bad)
send SMS messages from Healthchecks.io number (bad, can get expensive)
access Healthchecks.io customers’ billing addresses (very bad)

I have found no evidence that my private Bitbucket account was accessed or any secrets leaked. At the same time, I cannot conclusively prove it was not possible for the secrets to leak.

I debated with myself whether I should write this post. It boils down to a moral dilemma: do I write a full disclosure about what is likely a non-event, and risk a reputation hit? Or do I handle it quietly and just let it pass?

Codeship and Bitbucket

I found Codeship’s security notification on October 2 in my spam folder. According to Codeship, their database was exposed from June 2019 to June 2020. There is evidence that the attacker was actively using data from their database.

I am not using Codeship for anything Healthchecks-related, but I had granted Codeship access to my Bitbucket account for an unrelated project. The grant was required to set up automatic Codeship builds on each Bitbucket commit. Unfortunately, it looks like Codeship asks for way too many permissions:

I wish I had read this carefully and thought about the implications when I was setting this up back in January 2016.

After granting access, Bitbucket gives Codeship an OAuth access token, which ends up in Codeship’s database. Using the OAuth token, Codeship (or its attacker) can access and manipulate the user’s repositories.

After becoming aware of the incident, I revoked Codeship’s access from Bitbucket, and checked all my repositories for any rogue access keys or unexpected changes. I asked Codeship support if the Bitbucket access token could have been exposed. They said:

The Bitbucket token was potentially exposed. I would recommend revoking this token and looking through your repo history for any suspicious activity, although so far we have not heard of any Bitbucket-focused activity.

I asked Bitbucket support for API call logs. They did prepare and send a report, but unfortunately they only have data from the previous 3-4 weeks.

Healthchecks.io Secrets in Bitbucket

In Bitbucket, I have a private repository with deployment scripts. These are Fabric scripts for bootstrapping new machines, deploying and updating software, and various maintenance tasks. The repository also contains a local_settings.py file with all the production API keys and tokens the Healthchecks app needs to run.

Storing unencrypted secrets in version control is, of course, a mistake on my part. I have now fixed this and am now using Mozilla’s sops to encrypt the secrets. The GPG encryption key is on a Yubikey.

Rotating Secrets

Rotating all secrets was the most time consuming and stress-inducing part. Most of the secrets I had never rotated before, so I had to figure out a safe update procedure for each.

Database. One of the secrets was the PostgreSQL database password. Access to the database is also restricted at the firewall and the pg_hba level, so the password alone is not enough to access the database. Of course, I still wanted to change the password. The procedure was:

Create a secondary database user
Switch all services to use the secondary user
Update the password of the primary database user
Switch all services back to use the primary user

Sounds simple enough, but it involved lots of planning, checking, and double-checking. White knuckles and lots of coffees.

Services that support seamless key rotation: AWS SMTP credentials, Braintree, Matrix, Pushbullet, Sentry, Twilio. The process:

Create a secondary API key while the primary key still works
Update all services to use the secondary key
Promote the secondary token to the primary (or delete the primary, leaving just one)

I updated the keys one-by-one, checking everything after each step. In some cases, I kept the old key active for a couple days to make sure no production system was still using it.

Services that only support resetting the keys in place: Discord, LINE Notify, OpsDash, Pushover, Slack, Telegram. For these, the process is “quick hands.” Regenerate the secret, then update the production machines with the new secret as quickly as possible. Again, going through them one by one and testing everything after each change.

SSL certificates. I had the SSL certificates for healthchecks.io, hc-ping.com, and hchk.io purchased from Namecheap. They do support reissuing certificates and revoking the old ones. One complication here was that, for each hostname, I was using both RSA and ECDSA certificates (provisioned as instructed by Namecheap: issue RSA first, then reissue as ECDSA). To minimize the chance of making a mistake, I decided to purchase and deploy new certificates from a different provider and only afterwards revoke the existing certificates. I discovered SSLMate, and got the certificates from them. I liked the process of ordering certificates from the command line. It’s a big improvement over the usual process of copy-pasting CSRs around. I installed the new certificates, tested the setup thoroughly, and revoked the old certificates a few days later.

There is still one task on my list: Django’s SECRET_KEY setting. I’m investigating the impact of changing it. One issue here is that Healthchecks.io uses SECRET_KEY as one component in a hash function when generating the badge URLs. Naively changing SECRET_KEY would invalidate all existing badge URLs. Coupling badge URL generation and SECRET_KEY was a seemingly small design decision, and now it comes biting me in the rear years later! I will deal with this but wanted to get this post out with no further delay.

Closing Thoughts

As I wrote in the beginning, I have no evidence that any Healthchecks.io secrets have been leaked or used. Still, they should have not been accessible to third parties (Bitbucket and Codeship) in the first place – I’ve now fixed this mistake.

As always, if you have any questions, please write to contact@healthchecks.io.

– Pēteris, Healthchecks.io

About Tracking Cookies on status.healthchecks.io

By Pēteris Caune / September 24, 2020 September 24, 2020

status.healthchecks.io used to set an “ajs_anonymous_id” tracking cookie. I’m happy to report that it does not do that anymore since September 22, 2020. In this post, I’ll share the process I went through to get the tracking cookie removed.

For powering status.healthchecks.io, I am using a third-party hosted status page provider, Statuspage.io, by Atlassian. I initially set it up in May 2020 and wrote about it on this blog. After the setup, while poking around, I discovered my fancy new status page sets a tracking cookie. It does not ask for the user’s consent, and it does not obey the “DNT” header – when you visit the page, you get a tracking cookie.

I believe this cookie was only used for innocuous purposes (tracking the number of unique page visitors), but it still invades site visitors’ privacy and violates GDPR requirements. On May 7, I submitted a support ticket asking to remove the tracking cookie and got a reply with a bottom line: “We can’t avoid setting these cookies.” After asking again, I got back a non-commital “I will forward this to our product team and development team,” and that was that.

I had already invested a significant amount of time setting up automation and custom metrics for the status page. And, aside from the cookie issue, I was generally happy with the product. Before switching providers over this one issue, I wanted to take a crack at fixing it. It was unlikely Atlassian would spend any engineering resources just because a single $29/mo customer had complaints. So I needed to bump up the priority of the issue. I searched around for other Statuspage.io customers and started contacting them. My email template went through several iterations until I got to a version that felt transparent and not manipulative:

Subject: Cookies on status.somedomain.com
Hello,
when I visit status.somedomain.com I see it stores the following cookies in my browser:
* ajs_anonymous_id
* ajs_group_id
These are Atlassian’s tracking cookies. They are not essential, and so under GDPR they require the user’s explicit opt-in before they can be sent to the browser.
I am an Atlassian Statuspage customer myself, and my service’s status page has the exact same problem. I’ve contacted Atlassian about this but this appears to be low priority for them.
I am contacting you because I think more affected customers being aware of the issue and asking Atlassian to fix it = higher chance that they will actually do something.
Thanks,
Pēteris Caune

I started by manually sending ten or so emails out every week. I mostly got sympathetic and cooperative responses. There were some funny ones too. For example, one guy insisted that there is no problem because he could not reproduce the issue using “internal methods.” Me showing him the results of several different cookie scanning services (cookie-script.com, cookiebot.com) did not sway him.

I kept contacting other companies, and they sometimes forwarded me the responses they were getting from Atlassian. From these responses, it didn’t look like we were making much progress. In July, two months in, I decided to amp things up. I grabbed the Majestic Million dataset with the top million websites. I wrote a script that goes through the list, and, for each website, checks if it has an Atlassian-operated “status” subdomain. The script produced an HTML page with filtered results and “mailto:” links, to help me send out the emails. Side note: did you know the “mailto:” links can specify the message body?

To find email addresses, I found the best way was to look at each website’s privacy policy and search for the “@” symbol. I found typical contact addresses were privacy@somedomain.com and dpo@somedomain.com (where “dpo” stands for Data Protection Officer). On July 26-27, one by one, I sent out emails to around 200 companies.

The wave of new support tickets from various companies worked. Atlassian started communicating back a plan to implement a cookie consent banner in Q1 2021. Later in August, they started saying “late September 2020”. I held off from sending more emails and waited to see what would happen in September.

On September 22, I received an update from Atlassian. Instead of implementing a cookie consent banner, they decided to drop the Page Analytics feature, which was responsible for the tracking cookie. From my point of view, this is the best possible outcome – no tracking cookie and no consent banner. Statuspage.io still has an option of adding a Google Analytics tag. So, there still is a way to track the unique visits for those who need it.

Thank you, Atlassian / Statuspage.io, for implementing this change. I appreciate it! To my contact at Atlassian support, thank you for your patience.

To everyone who also contacted Atlassian about the tracking cookies, thank you! It took a team effort, but it worked out in the end!

– Pēteris

Database Fail-over on May 20

By Pēteris Caune / May 22, 2020 May 22, 2020

On May 20, the primary database server of Healthchecks.io experienced packet loss and latency issues. To ensure normal operation of the service, the database was failed-over to a hot standby. The fail-over went fine, aside from a minor issue with one fat-fingered firewall rule. Below are additional details about the network issue and the fail-over process.

The Network Issue

The symptoms: network latency and packet loss on the database host shoots up for 1-2 minutes, then everything goes back to normal. This had occurred a few times in the past weeks already, causing ping processing delays and subsequent “X is down” alerts each time.

This issue has been hard to troubleshoot because it seemed to happen at random times, and lasted only for a couple minutes each time. I was running a few monitoring tools continuously: Netdata agent including the fping plugin, OpsDash agent, and mtr in a loop logging to a text file. I could also inspect application logs for clues.

To illustrate the elevated latency, here’s how pinging 8.8.8.8 from the host looks normally:

$ ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=57 time=4.93 ms
64 bytes from 8.8.8.8: icmp_seq=2 ttl=57 time=4.96 ms
64 bytes from 8.8.8.8: icmp_seq=3 ttl=57 time=4.94 ms
64 bytes from 8.8.8.8: icmp_seq=4 ttl=57 time=4.95 ms
(…)

And here’s the same command during the problematic 2-minute window:

$ ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=57 time=104 ms
64 bytes from 8.8.8.8: icmp_seq=3 ttl=57 time=104 ms
64 bytes from 8.8.8.8: icmp_seq=4 ttl=57 time=102 ms
64 bytes from 8.8.8.8: icmp_seq=5 ttl=57 time=104 ms
64 bytes from 8.8.8.8: icmp_seq=6 ttl=57 time=104 ms
64 bytes from 8.8.8.8: icmp_seq=7 ttl=57 time=104 ms
64 bytes from 8.8.8.8: icmp_seq=8 ttl=57 time=104 ms
64 bytes from 8.8.8.8: icmp_seq=9 ttl=57 time=100 ms
64 bytes from 8.8.8.8: icmp_seq=11 ttl=57 time=104 ms
64 bytes from 8.8.8.8: icmp_seq=12 ttl=57 time=104 ms
64 bytes from 8.8.8.8: icmp_seq=13 ttl=57 time=104 ms
64 bytes from 8.8.8.8: icmp_seq=14 ttl=57 time=104 ms
64 bytes from 8.8.8.8: icmp_seq=16 ttl=57 time=104 ms
^C
--- 8.8.8.8 ping statistics ---
16 packets transmitted, 13 received, 18% packet loss, time 14996ms
rtt min/avg/max/mdev = 100.713/104.111/104.729/1.207 ms

Impact

When the latency from app servers and the database suddenly goes from 0.5ms to 100ms, there are going to be issues. The most pressing issue was the processing of incoming pings. On app servers, each received ping is put in a queue. A single worker process (a goroutine, to be exact) takes items from the queue and inserts them in the database. This is a sequential process and the latency to the database puts a limit on how many pings can be processed per second. To illustrate, under normal operation the worker can process all incoming pings without a backlog building up:

May 22 13:06:00 www1 hchk[13236]: 47 pings/s
May 22 13:06:01 www1 hchk[13236]: 105 pings/s
May 22 13:06:02 www1 hchk[13236]: 328 pings/s
May 22 13:06:03 www1 hchk[13236]: 265 pings/s
May 22 13:06:04 www1 hchk[13236]: 108 pings/s
May 22 13:06:05 www1 hchk[13236]: 72 pings/s

During the high-latency period, throughput drops significantly and backlog starts to build up (timestamps in CEST):

May 20 17:39:49 www1 hchk[21015]: 5 pings/s, queued 47, dwell time 2373ms
May 20 17:39:50 www1 hchk[21015]: 5 pings/s, queued 68, dwell time 3358ms
May 20 17:39:51 www1 hchk[21015]: 6 pings/s, queued 104, dwell time 3696ms
May 20 17:39:52 www1 hchk[21015]: 3 pings/s, queued 180, dwell time 4624ms
May 20 17:39:53 www1 hchk[21015]: 5 pings/s, queued 241, dwell time 5268ms
May 20 17:39:54 www1 hchk[21015]: 3 pings/s, queued 268, dwell time 6116ms
May 20 17:39:55 www1 hchk[21015]: 4 pings/s, queued 292, dwell time 6975ms
May 20 17:39:56 www1 hchk[21015]: 5 pings/s, queued 325, dwell time 7819ms
May 20 17:39:57 www1 hchk[21015]: 3 pings/s, queued 340, dwell time 8445ms

When the dwell time (the age of the oldest item in the queue) goes above 15 seconds, the app server declares itself unhealthy. When all app servers are unhealthy, clients start getting “HTTP 503 Service Unavailable” errors.

Even if the network hiccup is brief, and no client request gets rejected, there is still a negative consequence: there is a delay from the client making a HTTP request, to the ping getting registered in the database. For a check with a tight timing, this additional delay can push it over the limit and cause a false alert.

Fail-over

On May 20, after another latency spike, I decided to fail-over the database to a hot standby. I didn’t know what was ultimately causing the network issues, but the hope was that moving to a different physical host will make things better.

The fail-over procedure is simple:

make sure there is no replication lag
stop the primary
promote the hot standby

The application servers didn’t need to be updated or restarted – they knew IP addresses of both database hosts, and could automatically cycle between them until they found one that was alive and was accepting writes. More details on this in PostgreSQL documentation: Specifying Multiple Hosts and target_session_attrs.

The only issue I ran into during the fail-over was with a firewall rule denying one of the app servers access: I had mistyped port “5432” as “4322”, which was an easy fix.

Preparing a New Hot Standby

The service was back to normal operation, but the database host was now lacking a hot standby. Luckily, I had a fresh system (sporting Ryzen 3700X) already ordered and waiting on the sidelines. I used provisioning scripts to set up a hot standby on it, but checking and double-checking everything still took close to an hour.

Contacting Hetzner Support

I tried contacting Hetzner Support about the packet the loss and latency jump. I provided mtr reports (which they always ask for), and ping logs showing latency. After some forth-and-back ultimately the response I got was “There is and was no general network issue known. Also we do not see any packet loss from here.”

To be fair, this could have been an issue with the hardware or software on the physical host – I can’t say for sure.

Workarounds: Parallel Processing, Batching, Dedicated Queue?

As mentioned earlier, each app server processes pings one-by-one, over a single database connection. When a single operation blocks, the whole queue gets stuck. What if we used a connection pool? Or processed pings in batches – say, 100 at a time? Or added a dedicated queue component to the mix?

These are all interesting ideas worth of consideration. I’ve already done some work on the batch processing idea and have promising results. That being said, packet loss would be problematic even with these workarounds. It would need to be solved anyway.

As always, I will continue to look for opportunities to make the service more robust.

Pēteris,
Healthchecks.io

Healthchecks.io Status Page Facelift

By Pēteris Caune / May 1, 2020 May 1, 2020

The Healthchecks.io system status page at status.healthchecks.io recently received a revamp. Here are my notes on the new version.

First up, the components section shows the current and historic status of components:

Dashboard shows the status of the main website, healthchecks.io. “Operational” state here means the website responds to HTTP requests, and has a working connection to the PostgreSQL database. Checkly, an external uptime monitoring service, monitors the website and automatically updates the component’s status via Statuspage.io API. Checkly has powerful and flexible webhook notifications which makes this possible.

Ping API shows the status of the ping endpoint, hc-ping.com. “Operational” state means hc-ping.com is responding to HTTP requests and is inserting pings in the database with no excessive delay. Although the ping endpoint and the main website runs on the same physical servers, they use different software. So it makes sense to monitor them separately. The status of this component is updated automatically by Checkly, same as the dashboard.

Notification Sender shows the status of the background process sending out notifications. Status updates of this component are not automated yet.

Below the components is the “System Metrics” section with four metrics.

Processed pings is the number of valid ping requests (valid UUID, not rate limited) processed per second.
Queued incoming pings is the number of pings that have been received but not yet inserted in the database. A spike suggests either a database problem, or a connectivity issue inside the data center.
Notifications sent is the number of notifications sent per minute.
Queued outgoing notifications is the number of scheduled notifications waiting to be sent out. A growing number means either the notification sender is not working, or it cannot keep up.

I used the following criteria for picking the metrics to show:

The metric should tell something useful about the system.
The metric should be simple to explain. For example, I internally track a few different “queue dwell time” metrics. They are useful, but it would be hard to explain what they mean precisely, and how to interpret them.
The metric should be computationally inexpensive to measure. It should not require a heavy database query.

I considered several ways of measuring, aggregating and submitting the metrics. I ultimately went with:

Each web server exposes a metrics endpoint that an external system can scrape. Here’s a git commit where I added one of the endpoints.
On an external host, a script runs once per minute (via cron of course). It scrapes metrics data from each web server, then processes and submits it to Statuspage.io using their Metrics API. The script is less than 100 lines long.

If you notice gaps in metrics graphs, it could be because the external metrics collector has failed. There are ways to make the metrics collection more robust, but the current simple setup seems to work fine for now.

The final feature in the status page is Incidents. Currently I have not automated incident creation in any way. The plan is to manually open an incident when I become aware of it, and backdate it as makes sense. To test out the Incidents feature, I backfilled a couple past incidents. For example, Delayed notifications on February 7.

And that is all for now. I hope the new status page does not need to be used often! I will also keep posting outage notifications to @healthchecks_io on Twitter as well.

Happy monitoring and meta-monitoring,
Pēteris,
Healthchecks.io

Incident Report – 7 February 2020

By Pēteris Caune / February 10, 2020 February 10, 2020

On February 7, Healthchecks.io experienced an issue with sending notifications. An invalid cron expression slipped into the system, which caused the notification sending jobs to crash and restart in a loop. Timeline (all times are in UTC, and from February 7):

0:37: a check with a bad cron schedule gets created via API
0:41: the check receives its first ping
0:42: one minute later, the notification senders go into a crash-restart loop
1:02: external monitoring alerts go out
7:00: I’ve woken up and found out about the outage
7:28: The invalid cron expression is located and fixed, notification sending resumes
7:52: I post a tweet about the outage
10:00: Deployed mitigations for the “sendalerts” process repeatedly crashing, and stricter cron expression validity checks

This outage started at in the middle of night (2:42 AM local time) and so it took several hours until I found out about it and could jump on to fixing it. During this time, Healthchecks.io was not sending out any notifications (all types: emails, webhooks, Slack alerts, …). On the positive side, the web dashboard, the ping endpoints, the API and the badges were working normally.

After fixing the bad cron schedule, the notification senders resumed work and quickly went through the backlog of unsent notifications:

When notification sending resumed, Healthchecks.io sent out notifications for all checks that had flipped their state once (from “up” to “down”, or from “down” to “up”) during the outage. Unfortunately, it would have missed cases where a check flips twice (for example, “up” → “down” → “up”) during the outage window. If a check went down but came right back up during the outage window, Healthchecks.io missed it and didn’t send a notification.

The Root Cause

The “sendalerts” crash loop was tripping on the following cron schedule: “0 0 31 2 *”. Or, in human words, “at midnight of every February 31st“. The notification sender was crashing while calculating the next expected ping time for this schedule.

The Fix

To get around the immediate crashing problem, I manually edited the problematic cron schedule
In the “sendalerts” management command I added a mitigation for repeatedly crashing on the exact same check. With the mitigation, “sendalerts” postpones the problematic check for 1 hour, so it can process other checks in the meantime.
I added extra validation step for cron expressions. Healthchecks now makes sure it can calculate a valid “next ping time” for a cron expression before allowing it into the system.
When the outage started, I received monitoring alerts from three different services. All three alerts went to email, and I didn’t notice them until the morning. I’ve now updated notification settings to also receive Pushover notifications with the “Emergency” priority. These notifications override phone’s Do Not Disturb settings and repeat until acknowledged.

I apologize to all Healthchecks.io users for any inconvenience caused.

– Pēteris Caune,
Healthchecks.io

Comparison of Cron Monitoring Services (January 2020)

By Pēteris Caune / January 15, 2020 January 17, 2020

If you are looking for a hosted cron job monitoring service, good news: there many options to choose from! In this post I’m comparing a selection of the more popular ones: Cronitor, Healthchecks.io, Cronhub, Site24x7, CronAlarm, PushMon and Dead Man’s Snitch.

How I picked the services for comparison: I searched for “cron monitoring” on Google and picked the top results in their order of appearance. I was looking specifically for hosted, SaaS-style services.

Disclaimer: I run one of the services being compared, so I’m a biased source. In particular, choosing the axis of comparison is subjective, and of course I’m inclined to choose metrics that would make my service look good. When in doubt, do your own research!


Timeout-based schedules
Cron expression schedules
/fail endpoint
/start endpoint
Pinging via email
Team Features
Projects
Teams
Notification Methods
Email
Webhooks
Slack
SMS
Price per Month
For 1 cron job	free	free	free	$10	free	free	free
For 10 cron jobs	$24	free	$49	$10	$5	$15	$19
For 20 cron jobs	$24	free	$49	$10	$20	$25	$19
For 40 cron jobs	$24	$20	$49	$10	$20	$100	$19
For 80 cron jobs	$79	$20	$99	$10	$20	$100	$19
Authentication Methods
Username and password
Google or Github
SSO (SAML2)
Company Metrics
Years in business	5	4	2	13	5	8	8
Head count	2	1	1	?	?	?	?
Popularity in Slack App Directory	4	4	8	–	–	–	5

Timeout-based Schedules

Also called “simple” monitors, where the user specifies a period (for example, one hour). The client system must “check in” at least every period, otherwise the monitoring system raises an alert.

Timeout-based schedules are supported by every reviewed service except Site24x7.

Cron Expression Schedules

The user specifies a cron expression (for example, “0/5 * * * *”) and a timezone. The monitoring system calculates expected “check in” deadlines based on the cron expression.

Supported by: Cronitor, Healthchecks.io, CronHub, Site24x7.
Partially supported by: PushMon (uses non-standard syntax).
Not supported by: CronAlarm, Dead Man’s Snitch.

Fail Endpoint

The ability to explicitly signal a failure. This allows quicker failure notifications, without waiting before the configured timeout and grace time runs.

Supported by: Cronitor, Healthchecks.io, Cronhub, CronAlarm, PushMon, Dead Man’s Snitch.
Not supported by: Site24x7.

Start Endpoint

The ability to signal when a cron job execution starts. This enables the measurement and monitoring of the job’s run time.

Supported by: Cronitor, Healthchecks.io, Cronhub, Site24x7, CronAlarm.
Partially supported by: Dead Man’s Snitch (job’s runtime is reported on completion by a wrapper script).
Not supported by: PushMon.

Pinging Via Email

With this feature, clients can report their status by sending an email. This comes handy when integrating with a services that support email status reports and nothing else.

Supported by: Healthchecks.io, Dead Man’s Snitch
Not supported by: Cronitor, Cronhub, Site24x7, CronAlarm, PushMon.

Projects

The ability organize monitored cron jobs and their associated information by project. Project’s don’t matter much when you have only a few cron jobs to monitor, but become increasingly important as account’s size grows.

Supported by: Healthchecks.io, Site24x7 (“monitor groups”), PushMon (tagging), Dead Man’s Snitch (“cases”).
Not supported by: Cronitor, Cronhub, CronAlarm.

Teams

The ability to give your team members limited access to your account. The alternative would be to use a shared single-user account, which is of course not ideal.

Supported by: Cronitor, Healthchecks.io, Cronhub, Site24x7, Dead Man’s Snitch.
Not supported by: CronAlarm, PushMon.

Notification Methods

Notifications to email, Slack and custom webhooks is supported by all reviewed services. The support for other notification methods varies from service to service:


Email
Discord
Matrix
MS Teams
OpsGenie
PagerDuty
PagerTeam
PagerTree
Pushbullet
Pushover
Slack
SMS
Telegram
Trello
Twitter DM
VictorOps
Webhooks
WhatsApp

Authentication Methods

All reviewed services support classic authentication using username and password. Some of the services offer additional options:

Cronitor supports single-sign-on (SSO) using the SAML2 standard.
Healthchecks.io supports signing in via one-time sign in links to email.
Cronhub supports authentication using Github.
Site24x7, as a part of Zoho, supports a variety of single-sign-on options.

Years in Business

Site24x7 is the oldest company in the group with 13 years in business. Dead Man’s Snitch and PushMon are the second oldest, dating from 2011-2012. Cronitor, Healthchecks.io and CronAlarm were founded in 2014-2015. Cronhub is the youngest with 2 full years in business.

Head Count

Company size is a double edged sword. On one hand, larger companies seem like the safer option: they are less likely to shut down, and are more likely to have 24/7 staff monitoring their operations.

On the other hand, the smaller companies may have only a few people manning the systems, but they are passionate and committed. From my personal experience, when reporting problems to smaller companies, I’ve often had the issues fixed and a personal reply from a co-founder in hours.

The exact company size is usually not public information and I have only a few data points here:

Cronitor was started by two cofounders, August and Shane. I don’t know for sure but assume they are still a team of two.
Healthchecks.io is a one-man-band (the one man being me, Pēteris)
Cronhub is also a one-man-band, Tigran.

Popularity in Slack App Directory

Slack App Directory appears to be showing the apps by their popularity, and so can be used as an indirect way to compare real-world usage of the different services. I skimmed through the Developer Tools category and noted the positions:

Healthchecks.io and Cronitor are close by in search results on page 4.
Dead Man’s Snitch is on page 5.
Cronhub is on page 8.

In Closing

If you notice any factual errors, please let me know, and I’ll get them fixed ASAP!

There are many more things to compare (Do they have an API? Which country are they based in? What has their historic uptime been like? Which one has the prettiest landing page? …), but I decided to stop here. If you are shopping for a cron monitoring service, you will have to decide what is important for you, and likely do some additional research.

Happy monitoring,
– Pēteris

Fighting Packet Loss with Curl

By Pēteris Caune / January 3, 2020 May 1, 2024

One class of support requests I get at Healthchecks.io is about occasional failed HTTP requests to ping endpoints (hc-ping.com and hchk.io). Following an investigation, the conclusion often is that the failed requests are caused by a packet loss somewhere along the path from the client to the server. The problem starts and ends seemingly at random, presumably as network operators fix failing equipment or change the routing rules. This is mostly opaque to the end users on both ends: you send packets into a “black hole” and they come out at the other end – and sometimes they don’t.

One way to measure packet loss is using the mtr utility:

$ mtr -w -c 1000 -s 1000 -r 2a01:4f8:231:1b68::2
Start: 2019-10-07T06:25:42+0000
HOST: vams                              Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- ???                               100.0  1000    0.0   0.0   0.0   0.0   0.0
  2.|-- ???                               100.0  1000    0.0   0.0   0.0   0.0   0.0
  3.|-- 2001:19f0:5000::a48:129           46.3%  1000    1.7   2.2   1.0  28.9   2.5
  4.|-- ae4-0.ams10.core-backbone.com      4.0%  1000    1.1   2.5   1.0  43.7   4.6
  5.|-- ae16-2074.fra10.core-backbone.com  4.4%  1000    6.8   8.3   6.4  57.7   5.6
  6.|-- 2a01:4a0:0:2021::4                 4.7%  1000    6.8   6.8   6.4  26.4   2.2
  7.|-- 2a01:4a0:1338:3::2                 4.5%  1000    6.7  12.0   6.5 147.7  16.7
  8.|-- core22.fsn1.hetzner.com            4.4%  1000   11.6  16.4  11.4  84.9  14.2
  9.|-- ex9k1.dc14.fsn1.hetzner.com        5.2%  1000   16.7  12.4  11.5  47.4   3.5
 10.|-- 2a01:4f8:231:1b68::2               5.2%  1000   12.1  11.7  11.5  31.3   1.7

The command line parameters used:

`-w`	Puts mtr into wide report mode. When in this mode, mtr will not cut hostnames in the report.
`-c 1000`	The number of pings sent to determine both the machines on the network and the reliability of those machines. Each cycle lasts one second.
`-s 1000`	The packet size used for probing. It is in bytes, inclusive IP and ICMP headers.
`-r`	Report mode. mtr will run for the number of cycles specified by the -c option, and then print statistics and exit.

The last parameter is the IP address to probe. You can also put a hostname (e.g. hc-ping.com) there. The above run shows a 5.2% packet loss from the host to one of the IPv6 addresses used by Healthchecks.io ping endpoints. That’s above what I would consider “normal”, and will sometimes cause latency spikes when making HTTP requests, but the requests will still usually succeed.

Packet loss cannot be completely eliminated: there are always going to be equipment failures and human errors. Some packet loss is also allowed by IP protocol’s design: when a router or network segment is congested, it is expected to drop packets.

I’ve been experimenting with curl parameters to make it more resilient to packet loss. I learned that with enough brute force, curl can get a request through fairly reliably even at 80% packet loss levels. The extra parameters I’m testing below should not be needed, and in an ideal world the HTTP requests would just work. But sometimes they don’t.

For my testing I used iptables to simulate packet loss. For example, this incantation sets up 50% packet loss:

iptables -A INPUT -m statistic --mode random --probability 0.5 -j DROP

Be careful when adding rules like this one over SSH: you may lose access to the remote machine. If you do add the rule, you will probably want to remove it later:

iptables -D INPUT -m statistic --mode random --probability 0.5 -j DROP

I made a quick bash script to run curl in a loop and count failures:

errors=0
start=`date +%s`

for i in {1..20}
do
    echo -e "\nAttempt $i\n"
    # This is the command we are testing:
    curl --retry 3 --max-time 30 https://hc-ping.com/6e1fbf8f-c17e-4749-af44-0c81461bdd19
    if [ $? -ne 0 ]; then
        errors=$((errors+1))
    fi
done

end=`date +%s`
echo -e "\nDone! Attempts: $i, errors: $errors, ok: $(($i - $errors))"
echo -e "Total Time: $((end-start))"

For the baseline, I used the “–retry 3” and “–max-time 30” parameters: curl will retry transient errors up to 3 times, and each attempt is capped to 30 seconds. Without the 30 second limit, curl could sit for hours waiting for missing packets.

Baseline results with no packet loss:

👍 Successful requests	20
💩 Failed requests	0
⏱️ Total time	4 seconds

Baseline results with 50% packet loss:

👍 Successful requests	20
💩 Failed requests	0
⏱️ Total time	2 min 4 s

Baseline results with 80% packet loss:

👍 Successful requests	13
💩 Failed requests	7
⏱️ Total time	17 min 43 s

Next, I increased the number of retries to 20, and reduced the time-per-request to 5 seconds. The idea is to fail quickly and try again, rinse and repeat:

curl --retry 20 -m 5 https://hc-ping.com/6e1fbf8f-c17e-4749-af44-0c81461bdd19

When using the –retry parameter, curl delays the retries using an exponential backoff algorithm: 1 second, 2 seconds, 4 seconds, 8 seconds, … This test was going to take hours so I added an explicit fixed delay:

curl --retry 20 -m 5 --retry-delay 1 https://hc-ping.com/6e1fbf8f-c17e-4749-af44-0c81461bdd19

Fast retries with 1 second retry delay and 80% packet loss:

👍 Successful requests	15
💩 Failed requests	5
⏱️ Total time	18 min 18 s

Of the 5 errors, in 3 cases curl simply ran out of retries, and in 2 cases it aborted with the “Error in the HTTP2 framing layer” error. So I tried HTTP/1.0 instead. To make the results more statistically significant, I also increased the number of runs to 100:

curl -0 --retry 20 -m 5 --retry-delay 1 https://hc-ping.com/6e1fbf8f-c17e-4749-af44-0c81461bdd19

Fast retries over HTTP/1.0 with 80% packet loss:

👍 Successful requests	98
💩 Failed requests	2
⏱️ Total time	51 min 3 s

For a good measure, I ran the baseline version again, now with 100 iterations. Baseline results:

👍 Successful requests	75
💩 Failed requests	25
⏱️ Total time	60 min 22 s

Summary: in a simulated 80% packet loss environment, the “retry early, retry often” strategy clearly beats the default strategy. It would likely reach 100% success rate if I increased the number of retries some more.

Forcing HTTP/1.0 prevents curl from aborting prematurely when it hits the “Error in the HTTP2 framing layer” error.

Going from HTTPS to plain HTTP would likely also help a lot because of the reduced number of required round-trips per request. But trading privacy for potentially more reliability is a questionable trade-off.

From my experience, IPv6 communications over today’s internet are more prone to intermittent packet loss than IPv4. If you have the option to use either, you can pass the “-4” flag to curl and it will use IPv4. This might be a pragmatic choice in short term, but we should also keep pestering ISPs to improve their IPv6 reliability.

If you experience failed HTTP requests to Healthchecks.io, and fixing the root cause is outside of your control, adding the above retry parameters to your curl calls can help as a mitigation. Also, curl is awesome.

Happy curl’ing,
–Pēteris, Healthchecks.io

Preventing Office 365 ATP From Clicking Unsubscribe Links

By Pēteris Caune / December 20, 2019 December 20, 2019

Office 365 has a fancy optional feature called Advanced Threat Protection (ATP). It scans incoming emails for malware. It also opens any links in the emails and scans the contents of the links as well. Unfortunately, if an email message has an “Unsubscribe” link, ATP will “click” that link too and potentially unsubscribe the user.

The standard fix is to make sure a simple HTTP GET request does not unsubscribe the user. HTTP GET requests should be “safe” and side-effect free. To achieve one-click unsubscribe, my approach used to be:

A HTTP GET request to the unsubscribe URL serves a HTML form, and a tiny bit of JS to submit the form on page load
A HTTP POST request actually unsubscribes the user

This has been working fine with most email security scanners and link preview generator bots. But Office 365 ATP is more sophisticated than others: it scans URLs found in email messages by loading them in a full-blown browser, and the browser executes JS. This defeats my simple bot protection, and insta-unsubscribes Office 365 users with ATP enabled from Healthchecks.io notifications the moment they receive the first notification.

A simple solution would be to remove the auto-submit JS code, and always require a manual click from the user to confirm the unsubscribe. But I really didn’t want to give up the single-click unsubscribe functionality, and was looking for alternate solutions. I asked on StackOverflow and got an answer with several good ideas (thanks Adam!). I implemented the “timer” idea:

If the unsubscribe link is opened less than 5 minutes after sending the email, treat it as potential bot activity, and require manual confirmation (user/bot needs to do one extra click).
Otherwise, assume it’s human and auto-submit the form on page load.

From what I’ve seen in practice, ATP scans the links as soon as it receives the email. So it doesn’t receive the auto-submit JS code, and cannot execute it. I’ve verified with an affected user that this mitigation indeed seems to be working: since deploying the fix they have received several Healthchecks.io notifications, and ATP has not auto-unsubscribed them yet. So we’re good, at least until Office 365 changes tactics.

PS. There is also RFC 8058: “Signaling One-Click Functionality for List Email Headers”. It specifies the “List-Unsubscribe-Post” email header:

List-Unsubscribe: https://example.com/unsubscribe/opaquepart
List-Unsubscribe-Post: List-Unsubscribe=One-Click

This tells the email client: to unsubscribe without additional confirmation step, send HTTP POST to the URL in the “List-Unsubscribe” header. The email client can implement its own “Unsubscribe” function in its UI. For example, in Gmail you may see an “Unsubscribe” link next to sender’s address:

I’ve implemented this in Healthchecks.io, but I need an “unsubscribe” link in email’s footer too because people still look for it there.

Related discussion:

– Pēteris