Mistakes teams new to Chaos Engineering make

pnevares · on June 21, 2018

For those not previously familiar with the concept: https://principlesofchaos.org/

throwaway5752 · on June 21, 2018

It's not going to take over the world.

What this is a form of testing, and applying boring old requirements to operations. The whole idea of "chaos" is just a really sad tell about the state of planning for some places.

You should be able to articulate your reliability requirements in certain situations and the verify your stack meets them before a release. And if it's cost prohibitive to do that in a non-prod environment then planning around testing in production in your pre-release planning.

It's silly, it's like classifying a test engineer as an "exploratory tester", which would clearly be a mistake. That is just a type of activity of an engineer, not an engineer role. This is just exploratory load and scalability testing, and falls under the responsibility of a test or devops team.

cal5k · on June 21, 2018

I think the "chaos" is more akin to the emergent properties of complex systems, where disturbances in one component can propagate in unexpected ways.

"What would happen if..." experimentation doesn't negate the need for requirements. It seeks to actively test whether or not those requirements were adequate in the first place. It also recognizes that no system is static - almost all distributed systems are evolved over long periods of time.

No matter how careful your planning, shit sometimes breaks in the real world in ways you didn't/could not have reasonably anticipated.

gautamdivgi · on June 22, 2018

Totally agree. There’s a lot of literature and hacks on what to break, degrade, etc. Hardly anything on the fact that you have to have good performance and fault tolerance requirements - which come with workload and architectural constraints. Also hardly anything on good statistical methods to compare metrics and which metrics to compare

sp332 · on June 21, 2018

Chaos experiments are costly to implement relative to unit and integration tests and so are a poor choice for detecting issues unit or integration tests could. Instead, Chaos experiments should be designed to find the truly hidden flaws which only surface in real world usage with real user traffic and production environment configuration.

This is in case you didn't quite articulate or verify your real requirements. The more confident you are that your test environment matches your prod environment, the less you need this.

throwaway5752 · on June 21, 2018

I'm not unrealistic - functional requirements are hardly ever complete, and they're almost always better than scalability and capacity planning.

Call me old fashioned, but I don't like messing around with production. I think you can reasonably scale down production environments and simulated wan networks packet loss/latency. I can't imagine a platform moderate importance not having some sort of benchmarking and scalability as part of their pre-prod pipeline automated acceptance criteria. It feels like this work would fall under that engineering role (again, performance test engineer or devops/sre).

Anyway, I think I chafe at the name more than anything. I know nonlinear dymanics, stress testing, queuing theory, et al and it just feels overly glib for what should be the most serious sort of activity a company considers.

makmanalp · on June 21, 2018

If I may be pedantic for one moment for the sake of clarity:

> The more confident you are that your test environment matches your prod environment, the less you need this

I'd also add "workload" alongside "environment", which is often challenging to accurately simulate.

stevep001 · on June 22, 2018

That's great, until your service scales to the point where it's not possible to simulate prod in test.

rjain15 · on June 22, 2018

chaos testing can be part of SRE practice, its mostly around reliability of systems. I am not sure why someone in the comments is confusing it with functional testing or load testing.

mentat · on June 21, 2018

The most interesting chaos experiments are the ones that show you gaps in your observability, when you know you've broken something and know you can't see it. The teams I've been on have learned that there are different sorts of observability that you need for different kinds of failure. Also, with complex systems, you need to learn how to operate your observability. Designing or testing resilience without being able to measure effectiveness is a time sink. Know that you can see failure first.

pure-awesome · on June 22, 2018

I struggled to parse that title. A more verbose, paraphrased version:

"A list of the mistakes that teams make if they are new to the concept of Chaos Engineering"