How we used git bisect to debug HAProxy
Recently, we at CV-Library were preparing an upgrade to the server which runs many of our background processes. It felt pretty risky, because although none of our code would be changing, all the versions of CPAN modules and supporting software would be updated. So we brought up a test server, and started looking for problems.
When testing a script that updates our Solr indexes, we saw these strange errors coming back from Tomcat:
Tomcat has a history of returning 505 status codes due to its over-strict parsing of HTTP. But we hadn’t changed our code! Our suspicions immediately turned to HAProxy, which we use to load balance across the Solr cluster - the script did not return this error if we pointed it directly at Tomcat.
The first step was to boil down our complex script into a reproducible test case - one which did not hit our databases, making it much faster to track down the cause.
It turned out the keep_alive option was important to reproduce the issue. We know from past experience that our script performance suffers if we disable this option.
We were then able to confirm that the problem did not show under HAProxy 1.4.25, which we were using previously. Studying the traffic using Wireshark did not show anything obvious, and we confirmed that the issue persisted on HAProxy’s master branch.
Since we were really stuck by this point, we resorted to a git bisect. This took a few goes to get right, but we ended up with a script like this:
Run like this:
That took us to this commit, first released in HAProxy 1.5-dev22:
Aha! Reading up on ‘http-tunnel’, the behaviour of HAProxy has significantly changed - if you used a keep-alive connection, HAProxy 1.4 would only process the first request, and it would then leave the TCP connection open between client and server (i.e. no load balancing). In 1.5, HAProxy by default will try to do clever stuff to load balance keep-alive connections.
To work around this, for the moment we added this line to our haproxy.cfg, to restore the default 1.4 behaviour:
There is still the small matter of investigating why Tomcat returns the error code, and whether we can see performance benefits from turning on the “real” keep-alive mode. But for the moment, we have reduced the risk of this server migration through careful planning and testing.