Specifically it's about testing the chaos that is present in production systems, moreover, when we talk about chaos we don’t just mean failures and faults, but also expected (and oft unpredictable) events such a sudden spikes in traffic.
We believe it’s essential to invest time in proving out these scenarios, no one wants to be woken up at 3am to diagnose and fix a downed service - under extreme pressure. Engineers need experience handling failure and recovery scenarios, this means practice when support is available! It’s unreasonable to deem a system resilient when it’s recovery paths are both unknown and unpracticed. Wouldn’t it be better to cause the fault in a more controlled manner during working hours, when everyone is on hand to help?
Chaos testing is relatively immature when it comes to frameworks, and especially recognised standards. This is despite some of the big players (e.g. Google, Netflix) having performed chaos testing in one form or another for up to 10 years. That said, things are starting to gather momentum as the need for distributed systems increases, with more and more companies recognising the need to invest in reliability up front. It’s somewhat disappointing that so many businesses ignore reliability in favour of features, until the systems go down and they realise the true cost - unfortunately it’s too late at this point!!
Some advice when trying to gain adoption internally:
"Never let a good crisis go to waste"
― Winston S. Churchill
Netflix is widely recognised as a leader in this field, with several open source tools to help cause chaos. However causing chaos is just the beginning, how do we determine what’s correct behaviour, vs a newly discovered issue? The service crashing is an obvious one, but what about slow responses? At what point is slow considered a failure? Monitoring is key here, and we all know how simple monitoring is, right…? To elaborate further, it’s important to note that there are different perspectives as to what is meant by the term "correct behaviour" - to a dev, crashing a node is perfectly logical; to a client, if that crash impacts them, then it might have a business cost!
The point I’m trying to make here, is that doing this properly is going to require some serious ongoing investment - it’s not just a 3 point story we can cram in at the end of a project.
Chaos Community Day
I recently attended the second Chaos Community Day in Seattle, this small “conference" was organised by Casey Rosenthal (@caseyrosenthal) of Netflix, and hosted by Amazon. I’m not going to write a logbook about the day, I know others will be doing this. However I was amazed, and really impressed, not only by the community feel of the event (which I don’t find is all too common at conferences), but also by the number of big players who are really getting involved in chaos testing - and who shared their thoughts and experiences with everyone.
Here at Cake we’ve been investing some serious time and effort of late in various forms of chaos testing; for both our in-development and production applications. The enthusiasm shown by everyone at chaos community day totally reinforces our thoughts - not to mention the results (less downtime, improved reliability, ...). This is absolutely an invaluable part of the development of a new distributed system, and not just an afterthought post incident.
Quote of the day goes to Nora Jones (@nora_js) from Jet.com:
"Introducing Chaos is not the best way to meet your new colleagues, though it is the fastest."
For a “live blog” of the day’s events, check out the following twitter hashtag: https://twitter.com/hashtag/chaoseng
Watch this space for more about what we’re working on internally at Cake, we’ve got some exciting open source stuff to share with you all very soon.
Last but not least, I’d like to draw attention to a few must read papers and blogs on the subject - some of which were discussed at length on the day and indicate a really exciting future for this discipline, specifically ,  & .
 Introducing chaos Engineering by Bruce Wong (Netflix)
 Principles of Chaos Engineering
 Inside Azure Search: Chaos Engineering by Heather Nakama
 FIT : Failure Injection Testing by Naresh Gopalani (Netflix)
 Chaos Engineering Upgraded by Casey Rosenthal (Netflix)
 Automated Failure Testing by Kolton Andrus & Ben Schmaus (Netflix)
 Lineage-driven Fault Injection by Peter Alvaro et. al (UC Berkeley)
 Automating Failure Testing Research at Internet Scale by Peter Alvaro et al. (UC Santa Cruz, Gremlin & Netflix)
 The Eight Fallacies of Distributed Computing by Peter Deutsch
 Fallacies of Distributed Computing Explained by Arnon Rotem-Gal-Oz
 The Network is Reliable by Peter Bailis & Kyle Kingsbury