Throughput Upgrade (With Train Illustrations!)

Due to the nature of cron jobs, our ping endpoints (https://hc-ping.com/... URLs) receive spiky traffic:

On average we write ~500 ping requests/second to the database.
At the start of every minute, we write around 4000 pings/second.
At the start of every hour, we write over 10’000 pings/second.

The Healthchecks open-source project includes a fully functional, tested and type-annotated ping handler written in Python. On self-hosted Healthchecks instances, when you send an HTTP request to a ping URL, a Django view collects and validates information from the request, then uses Django ORM to update a Check object in the database and insert a Ping object in the database. This approach is good for tens to low hundreds of requests per second, depending on hardware. For example, on my dev machine (Intel i7-9700K CPU, NVMe SSD, PostgreSQL runs locally on the same machine), using uWSGI as the web server, and benchmarking with wrk, Healthchecks can serve about 250 ping requests per second.

In the name of efficiency, since nearly the beginning, hc-ping.com endpoints run an alternative closed-source ping handler written in Go. The Go app used to work like this:

HTTP handler collects and validates information from the request, and puts a Job object on a queue (a buffered Go channel)
A worker goroutine runs a loop which reads a Job object from the channel, and calls a PostgreSQL stored procedure which executes the required SELECTs, UPDATEs, INSERTs, and DELETEs.

In the production environment, the database runs on a separate server with around 1ms network latency. Pings are processed sequentially over one database connection. However, due to using a stored procedure that encapsulates multiple SQL operations, in simple cases, there is only one round-trip to the database per ping request.

If we imagined the worker goroutine as a coal mine, the database as a power station, and a database transaction as a train, we could illustrate this in OpenTTD like so:

This setup was working quite well! As the volume of pings gradually grew, in 2021, I updated the app to use two worker goroutines reading from the same Jobs channel:

In 2024, I increased the worker count to three, and in 2025 to four. With three physical servers running the Go app, there were a total of 3*4=12 sequential processes writing ping requests to the database. With the current hardware and network, they could cumulatively achieve around 5000 requests per second.

Batching, First Attempt

To increase ping processing throughput, in 2024, I investigated a batching idea. The idea was to collect multiple Job objects in a batch, pass them as an array to another stored procedure, which would iterate over them and call my existing stored procedure for each. I prototyped the idea and even deployed it to production for a short period, but ultimately switched back to the previous approach. The reasons for switching back were:

Things were getting quite complex. I had to use an array of a custom composite type to pass data to PostgreSQL. Passing data back from the stored procedure was also getting tricky. Debugging issues in PL/pgSQL was getting tedious with stored procedures calling other procedures several levels deep.
The performance of the new version was not that much better. I no longer have specific measurements, but there was only a mild improvement in throughput.
I found a serious bug in the code, and had to roll back to the previous version in a hurry. Rather than fixing the bug, I later decided to scrap the idea.

Batching, Take Two

In the back of my mind, I kept mulling over the batching ideas. I had read in the pgx documentation that COPY can be faster than INSERT with as few as five rows. As a performance optimization junkie, I was very tempted to find a way to use it. I also wanted to move processing and conditional logic from the stored procedure to the Go code. Web servers scale horizontally, but the database is a bottleneck, at least for writes. Anything that can be done outside the database should ideally be done outside the database.

Over the course of a month, I put together a new version:

HTTP handler collects and validates information from the request, and puts a Job object on a queue (a buffered Go channel).
A worker goroutine reads Jobs from the channel and assembles a Batch. It reads jobs until it either reaches a batch size limit (currently 100 items) or a timeout (currently 5 ms).
The worker then starts a database transaction and passes the assembled batch to a separate routine, which:
- SELECTs the needed data from the database (using a single SELECT for all items in the batch)
- UPDATEs the checks in the database (a separate UPDATE per batch item, but the queries are pipelined)
- Inserts pings using the COPY protocol (a single COPY operation for all items in the batch)
The worker then commits the transaction.

In terms of OpenTTD:

Of course, there were additional details and special cases to take care of, including:

When pinging by slug, check IDs need to be looked up by slug and ping key.
When using auto provisioning, checks may need to be created on the fly.
Jobs need to be sorted in a consistent order to avoid deadlocks between concurrent transactions.
Old pings need to be regularly cleaned up. However, this does not need to hold up HTTP request processing and can be done in a separate goroutine over a separate database connection.
If a database connection goes bad mid-transaction, the app must be able to reconnect and re-run the transaction without losing or duplicating any pings.

The new version of the app no longer uses stored procedures and instead runs regular SQL queries. I could still, in some cases, combine several queries into one by using common table expressions (WITH clauses) and/or subqueries.

For correctness testing, I reused the same Django tests that the Python version of the ping handler uses. I adapted them to make real HTTP requests to the Go app instead of using Django’s test client.

Batching, but with Two Workers

The batching strategy trades some latency for better throughput. To improve the median latency, I reintroduced the multiple workers idea: while one worker is submitting a batch to the database, another worker can already be assembling the next batch. To prevent worker goroutines from racing each other, I used a mutex to allow only one worker to assemble a batch at any given time. OTTD time:

This is the version currently serving requests on hc-ping.com. From the logs, I see it handles traffic spikes of over 11’000 requests per second, with no significant backlog forming. I have not benchmarked the production environment, but when benchmarking on my dev system, it achieved over 20’000 requests per second.

Lifecycle Of a Ping Request

Here’s an overview of the processing steps each ping goes through in the current setup:

Client makes a HTTP request to an https://hc-ping.com/... URL or sends an email to an ...@hc-ping.com email address.
The request arrives at one of our HAPoxy load balancers. The load balancer applies the initial, lax rate limiting and tarpits particularly spammy clients. It proxies the good requests to NGINX running on application servers.
NGINX applies more strict rate limiting, and also applies some geo-blocking. It proxies the good requests to the Go app
The Go app’s HTTP handler assembles a Job object and checks if it is not in the “404 cache”. Sidenote about the 404 cache: when a client pings a check that does not exist, the Go app returns a 404 response and also caches this fact. This way, the app can respond to subsequent requests for the same URL without hitting the database. We can do this optimization only for UUID URLs and not slug URLs, because UUIDs are assigned randomly, and clients cannot pick them. If a check with a given UUID does not exist now, we can relatively safely assume it will also not exist a minute, an hour, or a year later. The HTTP handler puts the good Job objects in a queue.
One of the worker goroutines picks up the job from a queue and adds it to a batch. When a batch is assembled, the worker starts a database transaction and executes a series of SELECT, UPDATE, COPY SQL commands to process the entire batch as a single unit. After it commits the transaction, the ping is in the database.

Next Steps For More Throughput

What happens when we need more throughput? A couple of thoughts:

Year by year, CPUs and NVMes are still getting faster. The servers Healthchecks.io is running on now are much faster than the servers we started with 10 years ago. We are not yet on the fastest reasonably available hardware; there is still room for vertical scaling.
Likewise, every major version of PostgreSQL is adding new optimizations and is getting incrementally faster.
I can likely eke out some additional throughput by tuning the maximum batch size and the worker count. And by tuning PostgreSQL configuration. And by using slightly more aggressive rate limiter settings.

When the above is not enough and we still need more throughput, the request backlogs will take longer and longer to clear after each traffic spike. When the average request rate creeps above what the system can handle, things will start to fall apart. I will need to have a scaling solution ready well in advance of that time.

Thanks for reading,
–Pēteris