Notes on Self-hosted Transactional Email

Since a little more than two months ago, Healthchecks.io has been sending transactional email (~300’000 emails per month) through its own SMTP server. Here are my notes on setting it up.

The Before

Before going self-hosted, Healthchecks sent email using 3rd-party SMTP relays: AWS SES and later Elastic Email.

The reason for switching from AWS to Elastic Email was GDPR compliance: at the time United States did not have an EU adequacy decision, but Canada (the registration country of Elastic Email Inc.) did.

The primary reason I kept looking for alternatives to Elastic Email was also GDPR compliance: a country with EU adequacy decision is good, but being based in the EU is even better. Another reason was their poor communication during service outages: some outages were not acknowledged on their status page, there were no timely updates via support chat or otherwise, and there were no post-mortems published after outages. To their credit, Elastic Email did fix the outages reasonably quickly, and I was overall happy with the service in terms of functionality and pricing.

The EU-based SMTP Relay Options

There are few EU-based SMTP relay services. None of the big names (AWS SES, Sendgrid, Mailgun, Mailchimp, Postmark) are EU-based. I tested a few options:

  • EmailLabs: OK in terms of functionality and pricing. Judging by the mix of Polish and English in the user interface and documentation seemed geared primarily to the Polish market.
  • SMTPeter: OK in terms of functionality and pricing. It was probably just bad timing, but had a major outage while I was testing it. Small shop.
  • Brevo (formerly Sendinblue): the most prominent EU SMTP relay service. Has open and click tracking enabled by default, and refused to turn it off before seeing live production traffic, so a non-starter for me.

None of the options seemed like an upgrade over what I already had, and I kept circling back to the idea of self-hosting. The common wisdom is that self-hosting email means endless deliverability problems, but maybe-maybe?

The Self-Hosting Options

In May 2023, I spent several weeks researching and trialing self-hosted SMTP servers: mox, Postal, Haraka, Zone MTA, OpenSMTPD, and maddy. My brain was getting fried from jumping between documentation sites, trying to make sense of the feature sets, and the pros and cons of each project. One thing that helped immensely was reading Email explained from first principles – it filled many gaps in my knowledge of email delivery.

Maddy

After experimenting with and strongly considering OpenSMTPD, I ultimately picked maddy. I iterated on a test configuration until I got it to do the required things:

  • Accept email on port 465 from authenticated users.
  • Rewrite its envelope sender from “@healthchecks.io” to “@mail.healthchecks.io” (required for routing bounce messages back to the maddy server).
  • Sign it using DKIM protocol.
  • Put outgoing messages in a queue, attempt to deliver them, and retry with exponential backoff.
  • Deliver messages to remote MTAs from a single, specific IP.
  • When delivery fails, send a webhook notification to a designated webhook handler. For permanent failures, the handler can take appropriate action – unsubscribe a specific user from email reports, or mark a specific email integration as disabled.
  • Listen for incoming email on port 25.
  • When a remote MTA sends a DSN (delivery status notification, “bounce message”), deliver it to the same webhook.
  • Use an automatically provisioned LetsEncrypt certificate for TLS encryption on port 465 and port 25.

I wrote the provisioning scripts for deploying Maddy and its configuration to a server. I added and updated the required DNS entries for SPF and DKIM. I implemented, tested, and deployed the webhook handler that would receive bounce notifications from maddy.

IP Warm-Up

I spent several weeks gradually switching outgoing email traffic from Elastic Email to the self-hosted maddy server. IP warm-up serves two purposes:

  • Slowly builds up the reputation of the sending IP address. Switching the entire sending volume to a new IP address all at once risks getting blocked by the receiving servers.
  • It lets me test email delivery in the production environment and fix any potential problems with fewer negative consequences.

The Failover IP Oopsie

One issue I discovered during the IP warm-up phase was that the brand-new mail server (a Hetzner AX41 dedicated server) experienced minute-long network hiccups a couple of times per day. The cause could be a faulty NIC, a faulty switch, or a noisy neighbor, and the easiest fix is ordering another server and hoping for better luck. In anticipation of such a scenario, I had ordered a failover IP so I could keep using the already warmed-up IP with the new server.

I set up a new server, switched the failover IP to it, and after a few days of testing, no more network hiccups! So I went ahead and canceled the original machine. Then, a few days later, around 2 AM local time, my monitoring notifications went off: email delivery was broken. I had assumed that “failover IP” is more or less what other providers call “floating IP.” Dazed and confused in front of a blue screen in the middle of the night, I realized my misunderstanding and mistake: the failover IP is owned by a specific server. Canceling the server also cancels the failover IP with all its sender reputation.

To fix the immediate problem, I temporarily switched the web servers back to using Elastic Email as the SMTP relay. I asked Hetzner support if there was any way I could get the released IP back. Minutes later, I got a reply stating in perfect German calmness that my request would need to be handled by a different department during business hours. The next morning, Hetzner added the lost failover IP back to my account. Phew!

Local Relay-Only MTAs for Reliability

The internet-facing SMTP server (mail.healthchecks.io) runs on a single machine. Each app server also runs a local maddy instance which accepts outgoing email messages from local clients only, and hands them off to mail.healthchecks.io. If mail.healthchecks.io is unavailable (for example, during server restart), the local maddy instances queue the messages and retry them later.

Summary, Pros, and Cons

The self-hosted maddy server has been handling all Healthchecks.io transactional email for over two months. I am keeping an eye on bounce notifications, the outbound email queue size, and blocklists. So far, there have been no significant deliverability issues–fingers crossed!

Cons of going self-hosted:

  • Self-hosted SMTP server is another service to maintain. It uses up the limited time and mental bandwidth I have.
  • The inevitable deliverability problems will be my problems.
  • In the case of maddy, no pre-built graphical management and monitoring dashboard.

And pros:

  • Complete control of subprocessors with access to customer data (just Hetzner in my case).
  • Complete control over server configuration.
  • Fixed direct costs (as long as a single server can keep up with the sending volume).
  • I learned a bunch of new stuff!

Thanks for reading,
Pēteris