The load-balanced capture effect

johngalt · on Feb 17, 2015

Anyone looking for more ideas on statistical anomaly detection. Check out this talk:

https://www.usenix.org/conference/lisa14/conference-program/...

Specifically the section about 20mins in where he talks about KS windowing.

jaytaylor · on Feb 17, 2015

This is an interesting talk, thanks for sharing!

For those not already familiar with K-S tests, I'll save you from a google query: http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test

waterside81 · on Feb 17, 2015

AWS uses health checks to solve this problem. When one of your load balanced instances does not respond to a health check often enough for a fixed size window, AWS automatically takes your instance offline. It works pretty well.

http://docs.aws.amazon.com/ElasticLoadBalancing/latest/Devel...

nbm · on Feb 17, 2015

Pretty much all load balancers have health checks - active where they reach out to each server, or passive where they observe the responses of existing requests if they can.

One of the issues is making your active health check more like a doctor's physical than "'tis but a scratch" self-reporting. But also ensuring you're not dealing with a whole bunch of hypochondriacs.

Passive health checks at least have the property that they fail servers when the servers are unable to serve, even if the active health check does not consider some subsystem in its response. But alone they can easily be fooled by really fast non-error responses.

Anyway, saying "name of brand of load balancers" solves this problem is only covering the most basic cases. General solutions are at best only the first step of the full solution. You need to think about the edges - which I suspect is what Rachel is advocating.

klaruz · on Feb 17, 2015

Cloudwatch does what you're referring to as well. It's more of a basic server monitoring system that happens to integrate with the load balancer.

You get a set of basic VM level metrics, and you can feed it custom metrics from your app, or log files. All of which can be configured to alarm. I don't think it's possible to run advanced statistics on the metrics for alarming (eg, standard deviation from 30 minute exceeds N), but it may be. Usually it's just an event count, like more than N 500 errors over X time.

I do agree you need to think deeper than basic health checks though, 'broken server' is always a hard boolean to nail down.

coderzach · on Feb 17, 2015

It doesn't solve the problem where one machine 500's 5% of the time while all the others only 500 0.5%.

Health checks also don't work if your health check ping path is healthy but some other part of your app is broken.

jrochkind1 · on Feb 17, 2015

Why doesn't it solve that problem?

If the load balancer is monitoring this, it's quite capable of knowing what the average number of 500's returned is (knowing it automatically, via monitoring over time), and noticing that one machine is returning 500's at 10x the rate every other machine is, and then reducing it in the rotation.

No? It would not surprise me if this would lead to yet _other_ undesired side effects in certain conditions though -- I'd expect the same of the solution OP proposes though, or anything automated like this.

eranation · on Feb 17, 2015

Yep, you can configure the number of failed requests to make it unhealthy, and number of successful ones to mark it healthy, also the timeout, and the polling time (e.g. every 30 seconds).

But as others also pointed out, you can still get some "captured" servers until the LB realizes the machine is not super fast but simply superbad.

Also your health check URL might return 200 and the rest of your app 404 / 500, which makes me think

Perhaps the LB should be aware that if it gets back way more 404 and 500 than average for known URLs, then this should be considered as a bad server... I assume advanced LBs support that?

citrin_ru · on Feb 17, 2015

It is not always practical to create checks for all possible errors. E. g. broken node can return 404 for URL-X and 200 for all other used URLs. And this URL-X can be "hot" at that time. I think it is not bad to monitor response time for backends and flag some alert if one node has response time significantly lower than mean among all nodes (with comparable hardware).

0xbadcafebee · on Feb 17, 2015

Production load balancers don't treat an HTTP response of 5xx as "success" and therefore won't continue to send traffic there. They also have periodic sanity checks of dynamic content which must match certain criteria or the host gets flagged. Monitoring systems also keep track of various periodic low/high/averages, tail access and error logs, alert on unusual criteria, and can trim the hosts in the scoreboard in extreme cases.

You'd typically learn this after probably six months of running a large-scale continuously-deployed dynamic website and it breaks from poorly-tested configuration changes + hardware issues. Sysadmins know this stuff. That's why there is a job title of sysadmin and not "developer who does sysadmin stuff sometimes".

nbm · on Feb 17, 2015

5xx may be the correct response - sometimes the server is asked to do something valid (ie, not a 4xx/client screwed up), but had an error when it tried it.

No load balancer I know of will remove a web server that returns a single 5xx from its healthy pool. It will need to use some heuristic as Rachel points out, some percentage that is based on the statistical norm. Otherwise it'll fail out too many hosts and cause a problem.

I think you're severely underselling developers. I've met people who have only had the Software Engineer title and never installed a Linux distribution who get this stuff at least as well as anyone who calls themself a systems administrator. Sysadmins don't have a monopoly on understanding failure cases and failure handling - false positives, false negatives, outliers and outlier detection, metrics to look at. It's a skill that comes from experience, which can happen (or not) whatever you call yourself or others call you.

I'm lucky enough that I get to focus on this type of problem, and while there's definitely an aptitude portion, it is also a teachable skill - I see my job partly as getting the team I'm working with to be able to do this stuff when I'm not around. That usually means finding two or more people in the team and cultivating their interest in it.

0xbadcafebee · on Feb 17, 2015

If your sanity check in your load balancer passes only on a 200, it will fail on a 500, disable the host, and keep retrying until it gets a 200 again. It helps for there to be more than one single request to try in your sanity check.

For "random" requests, if you have a 500 response, requests of the same "type" should not longer be sent to that host. This can be changed based on scoreboard settings. Depending on the context, you may choose to serve cached content on 500s. This is one of the reasons multiple layers of cache and application intelligence is so handy.

I'm not underselling anything. Domain-specific knowledge comes with experience. If you ask a mechanical engineer 'What's wrong with my car if it makes the noise "bang-sputz-sputz-screech-screech-screech?"', the engineer will start making you lists of what parts can make each of those noises and begin cross-referencing to see maybe in what conditions a combination of those might happen. The mechanic will immediately tell you that for your 1991 Mercury Sable, the A/F mixture is off, the MAF sensor needs cleaning, the radiator has a crack and the accessory belt needs replacing. Sysadmin is a trade, not a skill.

nbm · on Feb 17, 2015

Okay, if we're only considering active health checks, then I'm not sure any load balancer considers a 5xx a success by default, let alone a "production" load balancer.

For a non-healthcheck 5xx response, it almost never is clear that this host in itself is responsible for the 5xx response. 5xx is the correct response when there is an error on the server side (ie, not an error on the client side, not a correct response), but it doesn't mean the server is a problem - it just means that the server experienced a problem in serving the request. That failure itself may be from one of many RPCs that server made to other services. As such, all web servers behind the load balancer for that request type will exhibit the 5xx response (at some rate, and depending on any state in connection sharing/reuse between the server and their upstream service), and all would subsequently be removed. Which isn't the correct response at all.

As someone who has had the job title "Systems Administrator" and the job title "Software Engineer", and currently has neither but still does exactly what he's always done - solving problems by understanding systems and, among other things, by writing code - I wouldn't consider load balancing and failure domains/types/handling as the sole or even primary purview of a systems administrator - especially in the case of large installations.

0xbadcafebee · on Feb 17, 2015

There's different kinds of load balancers, and as such different responses to different criteria. If you don't want to serve 500 error pages to all your users, one of your load balancers (or "proxy layers", for more or less intelligent forms of load balancers) should be doing something when you're getting 500s, like moving traffic around, or serving different content. It's far too common for 500s to be due to a machine-specific or network-specific problem to just assume they'll resolve themselves or are unresolvable.

smutticus · on Feb 17, 2015

I agree with your first paragraph, but not so much with your second. This is one reason why people pay good money for production load balancers. This is also why people run independent scripts to retrieve test pages on their production servers, and monitor in general.

This is not a solved problem, but it's also not a novel problem. It's a problem that I've faced before, and it requires decent monitoring to catch.

bjwbell · on Feb 17, 2015

Corollary, Steve Yegge's comment that monitoring and QA are the same thing in his Platforms Rant (http://steverant.pen.io/)

slashnull · on Feb 17, 2015

This piece is the gift that keeps on giving

mortehu · on Feb 18, 2015

One way to alleviate this problem is to treat all failures as having a fixed cost equal to an expensive successful request. E.g. treat all >= 400 HTTP status codes as having taken 500ms. This works well even if there's a stable stream of faulty requests, since it'll affect all backends equally.

cakoose · on Feb 18, 2015

Reminds me of the differential gear in cars and what happens when one wheel is up in the air with no traction.

jrochkind1 · on Feb 17, 2015

> That is, it doesn't attempt to do any work, and instead just throws back a HTTP 5xx error

I'm surprised the OP doesn't suggest having your load balancer pay attention to returned error codes.

If the load balancer knows what average amount of 500 or non-200 responses is, and one unit is returning way more than an average rate of non-succesful responses, it would make sense for the load balancer to back off sending to that machine. But maybe still send an occasional request there, so it can notice when/if it's error rate returns to normal.

Do any load balancers work like this?

sytringy05 · on Feb 17, 2015

Not all load balancers support this (I recall some very expensive alteons that didnt) but worse is when the LB does support it but it isn't configured to check. Out of the box most LB's will do a TCP bind and if that works, then fire away.

The reverse of this issue can also cause problems where you get a bad node that accepts the request but never returns (or takes 2-3 mins to respond). As a rule the timeouts will cause queuing, thread pool starvation and general not working all the way back up your request chain to whatever is facing the internet where your site will either hang or give back a 503 page.

pmontra · on Feb 17, 2015

Some applications could have to serve fast requests (maybe ajax calls to a json api) and slow requests (full page rendering) from the same servers. In this case the technique from the post should be tweaked by creating two classes of requests with different averages and variances. A server should be taken out from the pool only if it doesn't belong to any of the classes.

_qc3o · on Feb 17, 2015

I think HAProxy can make sure the response code for HTTP backends is actually 200 and will mark the connection down if it gets 500 or 400.

Just goes to show you that thinking only of the happy path can often lead you astray.

memnips · on Feb 17, 2015

HAProxy absolutely can support an HTTP response code health check, but in my experience out of the box it just makes sure the port (say 80) is open. I learned this once the hard way and will never make that mistake again... ;)

packetized · on Feb 18, 2015

The article is an interesting mental exercise in using statistical analysis to identify issues in infrastructure, but the configurations that are assumed to be in use demonstrate an exceptional lack of understanding around modern L4-7 load-balancing solutions, whether hardware or software. 99% of the 'issues' in the article are solved by features that are present in nearly every COTS or open-source solution - they just require knowledgable people to configure and tune them for the system in question.

More than a few real-world front-end implementations lack the kind of rigorous instrumentation that's necessary to identify problems of the kinds mentioned in the article. While proper configuration and experienced ops folks tuning said infrastructure can solve most of the stated problems, very fine-grained monitoring is sometimes the only thing that allows for effective troubleshooting when things really go L-shaped.