Automated to Death (2009)

munificent · on July 1, 2016

There's a weird vibe in this article I don't like. It (correctly) notes that as the number of anomalies that the automation passes on to the human operators goes down, the rate that humans successfully handle them also goes down.

But it doesn't seem to do a good job of clarifying that the total number of incorrectly handled anomalies is still decreasing. Let's say your automation goes from handling 90% to 99% of the anomalies and that when it does handle one, it does so correctly. We'll say that the increased rarity of human interaction and the inattention and weakened training that causes makes the human pilots to go from being able to handle 90% of them correctly to only a terrifying 40%.

Let's run the simulation. With the old automation:

    1000 anomalies occur
     100 (10%) make it past the automation
      10 (10%) make it past the human operators

So 10 catastrophes. Now with the new moderately better automation and much worse human performance:

    1000 anomalies occur
      10 (1%) make it past the automation
       6 (60%) make it past the human operators

6 catastrophes. Even though the human performance was much worse, because they are the last stage in the pipeline, it has a lower effect.

Now, I just pulled these numbers out of my ass, but I think it's important to focus on the total number of automation+human failures and not single out one stage or the other. From the passenger's perspective, they don't care who saved their ass, just that it got saved. If we can make one stage more failure proof at the expense of the other, it can still be a net win.

galdosdi · on July 1, 2016

Also, there are things organizations can do to combat the problem of human operators getting rusty. Practice. A lot of organizations just don't do it though because it's too tempting to view the automation's cost savings as "free" and just take them for granted, but it can help a lot.

thaumasiotes · on July 1, 2016

People get rusty for a reason. Your solution suffers from a couple of problems:

- Practice isn't the same thing as actual events. Being good at practice is more likely to diverge from being good at crisis response as actual crises become rare, because the criterion of matching what would happen in a crisis gets much less important. Thus, peacetime militaries often need radical overhauling before they can really get much accomplished when war breaks out. (Also consider - we have a lot of people who are really interested in how medieval combat (whether in a battle line or a duel) worked, what effective use of weapons looked like, and so on. But for all the discussion, we don't know, and we can't know unless we actually stage regular battles-to-the-death with period tooling. One form of combat practice, however, has been preserved as European fencing. How closely does it correspond? Again, we don't know, but consensus is "not well".)

- The automation's cost savings are free -- in fact, in this example, they have a large negative cost, cutting catastrophes by 40%. Keeping everyone in shape to handle crises they're likely to never actually see is, arguably, an enormous waste of money. (In addition to actually being impossible much of the time, as in my first bullet point.)

stcredzero · on July 1, 2016

Practice isn't the same thing as actual events. Being good at practice is more likely to diverge from being good at crisis response as actual crises become rare

Had the crew of Asiana Flight 214 been properly trained in manual landing, lives could have been saved. Practice isn't the same as actual events, but it does produce measurable and valuable results.

You can verify this for yourself with something as mundane as giving a presentation.

vorotato · on July 1, 2016

on empty legs (flights back with no passengers) you could disable automation as protocol.

r00fus · on July 1, 2016

A very insightful idea. You could also measure this by ensuring pilots have a certain number of hours logged without automation (practice by pudding).

Are there analogies in other industries to this approach?

JoshTriplett · on July 2, 2016

> Are there analogies in other industries to this approach?

https://github.com/Netflix/SimianArmy/wiki/Chaos-Monkey

ChoHag · on July 2, 2016

Never pulled the plug on your UPS to make sure you can or ripped out half the raid array without warning?

galdosdi · on July 1, 2016

- Yeah, practice is definitely not the same, but depending on the domain, it's usually at least a lot better than nothing, even if the real thing is a lot better than practice. (To speak to your example, at least in wartime, peacetime militaries don't have to worry about reteaching their soldiers EVERYTHING. Maybe they've never been in real combat, but at least they can consistently shoot at a target. That's better than not being able to do that either.)

- I mean, it depends on the specific case, right? Obviously that's true in this example. You can also come up with opposing examples. I also misspoke a bit -- even if the automation cost savings are free and you're strictly better off with it than without it, adding some practice for human operators may get you even more savings, and it's often overlooked.

My point is just that practice/drilling can be a useful tool in the toolbox. It depends on the situation, but it shouldn't be ignored.

thaumasiotes · on July 1, 2016

> To speak to your example, at least in wartime, peacetime militaries don't have to worry about reteaching their soldiers EVERYTHING. Maybe they've never been in real combat, but at least they can consistently shoot at a target.

My understanding, which is only very loosely informed, is that while soldiers get retrained, peacetime generals, who make the decisions, usually need to be replaced. Crisis response, depending on the crisis, will vary in how much it demands one or the other skillset -- but I think someone whose job it is to oversee safety systems, when in practice (1) the safety systems almost never fail, so that (2) the job consists mainly of convincing politicians that they should feel good about what you're doing, can be reasonably closely analogized to a peacetime general. Making a contingency plan that sounds good to an audience with no experience is a different skill than making a contingency plan that effectively addresses the problem.

Practice definitely can be a useful tool. But just as there are situations where it will help, there are plenty of situations where it won't. When events are all drill and no reality, mission drift in the drill is inevitable. Sometimes a pound of prevention is worth an ounce of cure.

undergrowth54 · on July 1, 2016

First responders practiced for handling scenarios like the Boston bombings for years before 2013. It paid off.

thaumasiotes · on July 1, 2016

Initial responses to ebola in the US tended to be pretty badly bungled because essentially nobody was trained for handling very dangerous, highly contagious disease. Should they have been? Going back how long? Should they be now?

ohnomrbill · on July 1, 2016

"Keeping everyone in shape to handle crises they're likely to never actually see is, arguably, an enormous waste of money."

This point, while true, depends heavily on the consequences of the crisis. If the cost of failure is too high, it's not a waste of money.

yummyfajitas · on July 2, 2016

Even though the human performance was much worse, because they are the last stage in the pipeline, it has a lower effect.

Minor mathematical nit: it's not because of which stage of the pipeline you are in. Multiplication is commutative.

Jtsummers · on July 1, 2016

  However, when the second accelerometer failed, a latent software
  anomaly allowed inputs from the first faulty accelerometer to be
  used, resulting in the erroneous feed of acceleration information
  into the flight control systems. The anomaly, which lay hidden for a
  decade, wasn’t found in testing because the ADIRU’s designers had
  
  never considered that such an event might occur.
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

An important lesson. Assumptions can kill.

xanderstrike · on July 1, 2016

Here's my theory. The engineers assumed it would never occur because they fail rarely and if one failed you'd replace the unit.

Then it got into the hands of the airlines and they said, "You mean it'll run with one failed accelerometer? Then we don't need to replace it when one fails."

Jtsummers · on July 1, 2016

If they were even aware that it'd run with one failed accelerometer.

Aircraft maintenance outside the US military (Air Force in particular) terrifies me. While it's not perfect, the USAF essentially rebuilds most of its fleet every X years. There's a reason they're flying planes, successfully, from 60 years [ago]. [EDIT: Missing a word.]

The airline industry does not do this sort of depot level work. Instead, they discover a crack, the put on a piece of sheet aluminum and "patch" it. Wash, rinse, repeat. 10 years later the aircraft is still flying but at greatly reduced fuel efficiency because, like an overweight, middle-aged man, it's gotten a bit of extra weight on it all the damn time.

This is what they do for structural maintenance. I don't even want to imagine what happens with electronics and other subsystems. They're literally willing to cost themselves millions of dollars a year in extra fuel consumption (across their fleet), rather than spend the money to do real maintenance on it.

mahyarm · on July 1, 2016

So what does the USAF do when they find a crack? If they do rebuilds it sounds like it takes a lot more time to do the same fix.

Jtsummers · on July 1, 2016

They monitor and patch it but eventually the plane gets stripped down to basically the frame and rebuilt.

Their patches are treated as temporary patches. The airline industry has a bad habit of making them effectively permanent.

ArkyBeagle · on July 1, 2016

Somehow, "never considered" seems a different thing than "assumed it would never happen".

Jtsummers · on July 1, 2016

My reading is that the assumption was: There will never be two faulty sensors in flight.

ArkyBeagle · on July 1, 2016

Understood. It's just ambiguous. Being nauseatingly pedantic :), it just seems that the probability of not thinking of it is higher than thinking of it and dismissing it.

I just have a failure-fetish :) What ya gonna do?

Jtsummers · on July 1, 2016

That's fair. And see my comment in the parallel thread. In that situation, I think the original tester just never conceived of the possibility that there would be a failure with reporting of fires when two systems which functioned individually were run together. So the procedures didn't have the situation defined in the test plan. It was in beefing up the test procedures that the error was discovered (and others).

ArkyBeagle · on July 2, 2016

"The original tester" was probably working from a Byzantine morass of requirements.

Jtsummers · on July 2, 2016

Surprisingly not. I got to read all the original documents. They were unimpressive in every way.

I had the pleasure of putting everything into a proper requirements tracking database (a thing that company actually did get right by the time I got there, moved away from word and excel based document systems). It was all pretty straightforward and a very simple system in the scheme of things. Fundamentally the sensors all worked fine. There was a box which collected all the sensor data and that was what was failing.

marcosdumay · on July 1, 2016

That's going in the direction of proving software correctness. What, for software that interacts with the real world, and must take real word context into account stops being viable really fast.

Jtsummers · on July 1, 2016

Yes, we should go in that direction on anything safety critical.

This is a real world situation that I encountered and led to me leaving the company, they did correct it after I left, however:

Fire detection system in multiple areas of a plane. If area A reported a fire before area B, then area B's report would be ignored. Reverse the situation and area A's report is ignored. This system, in particular, discovered fire/overheat early so that corrective action could be taken. By not making the second report it virtually guaranteed a fatal consequence should the (admittedly) very low probability event occur.

Also in this situation, like the aircraft in the article, there were a very small number of areas and sensors. Full, automated testing was entirely possible of literally every combination of sensor fault condition, ordering, and delays (to simulate "faulted" hardware, that is a sensor that reports a fire but isn't backed up by the other 2 sensors in its area). I calculated it at the time, the full series of automated tests could have been executed in a week, entirely reasonable given the lives we were responsible for.

However, we were not given a budget for this (until after I left), and so we were stuck with manual testing. Which drove the costs and time to an extreme amount (also made some testing effectively impossible, "flip this switch, flip this other switch after 3 but before 3.5 seconds"). The entire setup was a fucking joke. And lives depended on it.

On top of this, the software was a complete clusterfuck of shared variables used as temporaries across multiple subroutines, values getting erased (probably causing the problem that I discovered) when stepping through the piss-poor attempt at a state machine.

When hundreds of lives can be lost by the absence of a single, sensible, test, I'd gladly accept an extra couple million up front on a project, rather than deal with the emotional consequence of knowing my responsibility in their deaths, paying out millions to their widows and orphans, and possible civil and criminal consequences for the PEs involved.

marcosdumay · on July 1, 2016

That's still depends entirely on the number of variables you must take into account.

You are ok with an extra million of up-front cost, and that's reasonable. With a little bit of added complexity, that cost reaches an extra billion. Is it still ok? With a little more, and it's now a trillion.

This is a really hard problem.

Jtsummers · on July 1, 2016

And that's where risk analysis comes in. But modeling the system, to a certain reasonable scale (read: feasible with respect to time), and sane programming practices can seriously mitigate the errors these systems have.

I'd be willing to wager that the faulty system discussed in my quote had a set of variables like this at the top:

  int t1;
  int t2;
  int t3;

Which are all used by various functions later on, obliterating the value the others set. And someone forgot that they were supposed to be temporary. I'd even put $100 on it (I'm poor, ok?).

pipio21 · on July 1, 2016

I don't really understand. If one accelerometer fails for years and nothing is reported it is a big failure on the design team, probably with criminal responsibility.

Depending on GPS for main navigation is also very bad idea.

In the near future with the cost of today one fiber optic gyro and accelerometer you will be able to buy ten. Software will improve and redundancy like it has done tremendously in the past making airplanes the safest of transports precisely because it does not depend so much on fallible humans that get tired and need to rest,pee and other biological necessities, have ego(that blinds their judgments) or get in love with the air hostess,get bored(some flying could bore you to tears) have problems of vision or hearing with age, get distracted(and lose situational awareness) or ill or intoxicated by food.

It is easy to forget that death was what we had when humans were in charge. We are talking about thousands of times more dangerous than today. So the title is yellow sensationalistic garbage.

The only reason humans have not been completely replaced is because people naturally trust other people more than machines, landing on side winds automatically requires engineers taking responsibility for it(and nobody had sone so, so far), and someone needs to be in charge in the plane at all time(for example what to do if a person have an stroke).

scotty79 · on July 2, 2016

If you need operator to be ready to take over then allow him to play a game with the vehicle he drives. Give him points for how close his attempts at controlling vehicle are to what the software that actually contols the vehicle does. This way if emergency that can be handled by automation arrives he can train for it without risk but when you need to hand over the control to him he'll be ready and aware and do his best.

userbinator · on July 1, 2016

It is notable that this article was published only months after AF447[1] which crashed also due to pilots' lack of experience in flying without automation.

[1] https://en.wikipedia.org/wiki/Air_France_Flight_447

ChoHag · on July 2, 2016

> “People, after all, are the backup systems, and they aren’t being exercised.”

If it's not tested, you don't have a backup.

exar0815 · on July 1, 2016

Every automated System needs a very well trained and calm operator when it all goes south. Thats the difference between e.g. Chernobyl and Fukushima/Three Mile Island. While very bad accidents, the last fail-save, the humans, didnt fuck it completely and spectacularly in the latter two cases.

throwanem · on July 1, 2016

Chernobyl isn't a very good example of automation failure; responsibility for that disaster lies entirely with human beings from start to finish. Wikipedia's summary is solid, and rather than excerpt it here I'll just point you at https://en.wikipedia.org/wiki/Chernobyl_disaster#Accident .

exar0815 · on July 1, 2016

Yeah, my comment doesnt make that much sense, reading it again. That was just an example how badly trained personnel can make any accident worse, justifying a very well trained operator for crucial automated systems.

dougk16 · on July 1, 2016

"Every automated System needs a very well trained and calm operator when it all goes south."

I think that's sort of the paradox discussed in the article, or at least one that comes to my mind. The more automation you have in aggregate in society the less you'll have trained humans capable of responding when it goes south. Even individuals that keep up with regulated training and certification requirements will get lazy if 20 years pass without an incident. But then the flipside is that hey 20 years passed without incident. It will be interesting to see where the equilibrium is reached, especially considering that that one-in-twenty-year incident by a machine will probably be weighted just as high in public perception as 20 years of human screw ups.

digi_owl · on July 1, 2016

Brings to mind Burke's Connections series. Specifically the first episode, where he presents all the technology needed to power New York City.

Damn it, ever so often i sit in near awe that i can be typing this message and expect it to reach the server etc. The number of wires and circuits that need to work properly for that to happen is staggering.

I_HALF_CATS · on July 1, 2016

Also check out the podcast by "99% Invisible"

http://99percentinvisible.org/episode/children-of-the-magent...

outworlder · on July 2, 2016

> As the plane passed 39 000 feet, the stall and overspeed warning indicators came on simultaneously—something that’s supposed to be impossible, and a situation the crew is not trained to handle.

But it is not impossible at all! That's called the "coffin corner". All flight crews are aware of it.

mschuster91 · on July 1, 2016

For this reason, U-Bahn (subway) drivers in Munich have to randomly drive under signalling (i.e. total manual control), while "normal" operation is that the computer handles everything from acceleration over cruise to stopping at the station.

S-Bahn (in Munich) is fully manual, too, but augmented.

scotty79 · on July 2, 2016

> drive under signalling

How would that help if signalling failed?

edem · on July 1, 2016

This is almost exactly the same as the Law of leaking abstractions don't you think?

lunchTime42 · on July 1, 2016

Could the decay of abilitys be avoided if the supervisors where kept in constant uncertainty wether the system is working?

Thriptic · on July 2, 2016

Possibly but it would probably lead to people disregarding the data they are being provided with, effectively removing a lot of the benefits of the automation.