Fighting Packet Loss with Curl

One class of support requests I get at Healthchecks.io is about occasional failed HTTP requests to ping endpoints (hc-ping.com and hchk.io). Following an investigation, the conclusion often is that the failed requests are caused by a packet loss somewhere along the path from the client to the server. The problem starts and ends seemingly at random, presumably as network operators fix failing equipment or change the routing rules. This is mostly opaque to the end users on both ends: you send packets into a “black hole” and they come out at the other end – and sometimes they don’t.

One way to measure packet loss is using the mtr utility:

$ mtr -w -c 1000 -s 1000 -r 2a01:4f8:231:1b68::2
Start: 2019-10-07T06:25:42+0000
HOST: vams                              Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- ???                               100.0  1000    0.0   0.0   0.0   0.0   0.0
  2.|-- ???                               100.0  1000    0.0   0.0   0.0   0.0   0.0
  3.|-- 2001:19f0:5000::a48:129           46.3%  1000    1.7   2.2   1.0  28.9   2.5
  4.|-- ae4-0.ams10.core-backbone.com      4.0%  1000    1.1   2.5   1.0  43.7   4.6
  5.|-- ae16-2074.fra10.core-backbone.com  4.4%  1000    6.8   8.3   6.4  57.7   5.6
  6.|-- 2a01:4a0:0:2021::4                 4.7%  1000    6.8   6.8   6.4  26.4   2.2
  7.|-- 2a01:4a0:1338:3::2                 4.5%  1000    6.7  12.0   6.5 147.7  16.7
  8.|-- core22.fsn1.hetzner.com            4.4%  1000   11.6  16.4  11.4  84.9  14.2
  9.|-- ex9k1.dc14.fsn1.hetzner.com        5.2%  1000   16.7  12.4  11.5  47.4   3.5
 10.|-- 2a01:4f8:231:1b68::2               5.2%  1000   12.1  11.7  11.5  31.3   1.7

The command line parameters used:

`-w`	Puts mtr into wide report mode. When in this mode, mtr will not cut hostnames in the report.
`-c 1000`	The number of pings sent to determine both the machines on the network and the reliability of those machines. Each cycle lasts one second.
`-s 1000`	The packet size used for probing. It is in bytes, inclusive IP and ICMP headers.
`-r`	Report mode. mtr will run for the number of cycles specified by the -c option, and then print statistics and exit.

The last parameter is the IP address to probe. You can also put a hostname (e.g. hc-ping.com) there. The above run shows a 5.2% packet loss from the host to one of the IPv6 addresses used by Healthchecks.io ping endpoints. That’s above what I would consider “normal”, and will sometimes cause latency spikes when making HTTP requests, but the requests will still usually succeed.

Packet loss cannot be completely eliminated: there are always going to be equipment failures and human errors. Some packet loss is also allowed by IP protocol’s design: when a router or network segment is congested, it is expected to drop packets.

I’ve been experimenting with curl parameters to make it more resilient to packet loss. I learned that with enough brute force, curl can get a request through fairly reliably even at 80% packet loss levels. The extra parameters I’m testing below should not be needed, and in an ideal world the HTTP requests would just work. But sometimes they don’t.

For my testing I used iptables to simulate packet loss. For example, this incantation sets up 50% packet loss:

iptables -A INPUT -m statistic --mode random --probability 0.5 -j DROP

Be careful when adding rules like this one over SSH: you may lose access to the remote machine. If you do add the rule, you will probably want to remove it later:

iptables -D INPUT -m statistic --mode random --probability 0.5 -j DROP

I made a quick bash script to run curl in a loop and count failures:

errors=0
start=`date +%s`

for i in {1..20}
do
    echo -e "\nAttempt $i\n"
    # This is the command we are testing:
    curl --retry 3 --max-time 30 https://hc-ping.com/6e1fbf8f-c17e-4749-af44-0c81461bdd19
    if [ $? -ne 0 ]; then
        errors=$((errors+1))
    fi
done

end=`date +%s`
echo -e "\nDone! Attempts: $i, errors: $errors, ok: $(($i - $errors))"
echo -e "Total Time: $((end-start))"

For the baseline, I used the “–retry 3” and “–max-time 30” parameters: curl will retry transient errors up to 3 times, and each attempt is capped to 30 seconds. Without the 30 second limit, curl could sit for hours waiting for missing packets.

Baseline results with no packet loss:

👍 Successful requests	20
💩 Failed requests	0
⏱️ Total time	4 seconds

Baseline results with 50% packet loss:

👍 Successful requests	20
💩 Failed requests	0
⏱️ Total time	2 min 4 s

Baseline results with 80% packet loss:

👍 Successful requests	13
💩 Failed requests	7
⏱️ Total time	17 min 43 s

Next, I increased the number of retries to 20, and reduced the time-per-request to 5 seconds. The idea is to fail quickly and try again, rinse and repeat:

curl --retry 20 -m 5 https://hc-ping.com/6e1fbf8f-c17e-4749-af44-0c81461bdd19

When using the –retry parameter, curl delays the retries using an exponential backoff algorithm: 1 second, 2 seconds, 4 seconds, 8 seconds, … This test was going to take hours so I added an explicit fixed delay:

curl --retry 20 -m 5 --retry-delay 1 https://hc-ping.com/6e1fbf8f-c17e-4749-af44-0c81461bdd19

Fast retries with 1 second retry delay and 80% packet loss:

👍 Successful requests	15
💩 Failed requests	5
⏱️ Total time	18 min 18 s

Of the 5 errors, in 3 cases curl simply ran out of retries, and in 2 cases it aborted with the “Error in the HTTP2 framing layer” error. So I tried HTTP/1.0 instead. To make the results more statistically significant, I also increased the number of runs to 100:

curl -0 --retry 20 -m 5 --retry-delay 1 https://hc-ping.com/6e1fbf8f-c17e-4749-af44-0c81461bdd19

Fast retries over HTTP/1.0 with 80% packet loss:

👍 Successful requests	98
💩 Failed requests	2
⏱️ Total time	51 min 3 s

For a good measure, I ran the baseline version again, now with 100 iterations. Baseline results:

👍 Successful requests	75
💩 Failed requests	25
⏱️ Total time	60 min 22 s

Summary: in a simulated 80% packet loss environment, the “retry early, retry often” strategy clearly beats the default strategy. It would likely reach 100% success rate if I increased the number of retries some more.

Forcing HTTP/1.0 prevents curl from aborting prematurely when it hits the “Error in the HTTP2 framing layer” error.

Going from HTTPS to plain HTTP would likely also help a lot because of the reduced number of required round-trips per request. But trading privacy for potentially more reliability is a questionable trade-off.

From my experience, IPv6 communications over today’s internet are more prone to intermittent packet loss than IPv4. If you have the option to use either, you can pass the “-4” flag to curl and it will use IPv4. This might be a pragmatic choice in short term, but we should also keep pestering ISPs to improve their IPv6 reliability.

If you experience failed HTTP requests to Healthchecks.io, and fixing the root cause is outside of your control, adding the above retry parameters to your curl calls can help as a mitigation. Also, curl is awesome.

Happy curl’ing,
–Pēteris, Healthchecks.io