Incident Report – 27 October 2019

On 27 October around 18:30 UTC one of the load balancers serving, and started experiencing network issues. A monitoring script correctly detected this and removed the affected load balancer from DNS rotation, but some requests still got lost while the DNS changes propagated.

After receiving a monitoring alert I jumped in to investigate. The load balancers and backend servers are set up to communicate over a virtual private network (Hetzner’s vSwitch feature). The problematic load balancer was intermittently unable to reach the backend servers over their private IP addresses. Haproxy was reporting backends sporadically flipping between healthy / unhealthy states.

I opened a support request with Hetzner at 19:03 UTC, and in the meantime tried various things to fix the issue myself. Hetzner support confirmed they are dealing with an issue and at 22:21 UTC they reported the issue was resolved. The load balancer could reach all its peers over their vSwitch IP addresses again. I monitored connectivity for one more hour and then added the load balancer back to DNS rotation.

At the time of writing this, I don’t yet know what was the exact root cause was. The vSwitch configuration had worked with absolutely no problems for many months, and this incident came seemingly out of the blue. I requested more details from Hetzner support and so far they’ve said they are still analyzing this incident, but the vSwitch feature itself is considered “very stable”. I have not made up my mind whether to replace the vSwitch networking with something else. I don’t want to make hasty decisions, and will first wait for more information from Hetzner.

In summary, the good:

  • A monitoring script correctly detected a malfunctioning load balancer and removed it from DNS rotation.
  • Luckily I was able to investigate immediately (the incident happened on Sunday at 8:30PM local time).
  • Quick communication from Hetzner support: 7 updates from them in 3 hours.
  • And, what really matters in the end: they got the issue fixed.
  • The other load balancer remained fully functional the whole time. Some of the uptime monitoring services didn’t even notice an outage.

And the bad:

  • Some pings from client systems did get lost. This likely resulted in a number of false “X is down” alerts after the outage.

I apologize to all users for any inconvenience caused. For a monitoring service, any downtime is unacceptable, and I will continue to look for any opportunities to make the service more robust.

– Pēteris Caune,