Rebranding: “Chaos Engineering” as “Resilience Testing”
Last week, I had the opportunity to hear Aaron Rinehart brief on “Chaos Engineering”. Aaron is the Chief Enterprise Security Architect at UnitedHealth Group. His job is to break things. And by break things, I don’t mean innovate as in “move fast and break things” or break-in to things as in penetration testing but actually break systems. It’s a cool job.
UnitedHealth is a beast of a company. Cobbled together through dozens of acquisitions, it has 270,000 employees and $200 billion in revenue. Also worth noting, when seemingly every other health insurance company got owned in 2015, UnitedHealth dodged the bullet. Maybe they weren’t targeted. Maybe the breach wasn’t discovered. Maybe they got lucky. Or maybe there is something to this approach.
As networks and applications (and their interactions) grow increasingly more complex and interdependent, the likelihood of cascading (and therefore devastating) failure also increases. Purposefully causing these failures in a deliberate and controlled fashion is possibly the only rationale way to persist in the faith of both system error and malicious activity.
As Rinehart explained Chaos Engineering, the idea is to purposefully crash parts of the network or systems and see how the network or system (or more, importantly, the people) respond. Initiating chaos in a controlled fashion and at a time when it can be managed is far better than letting technology take its course and setting off everyone’s pagers at 3 a.m. on Christmas morning.
In Rinehart’s own words, Security Chaos Engineering is, “…the discipline of instrumentation, identification, and remediation of failure within security controls through proactive experimentation to build confidence in the system’s ability to defend against malicious conditions in production.”
Rinehart lists out seven principles of chaos engineering:
- Security capabilities must be end-to-end instrumented.
- Security must be continuously instrumented to build confidence in the system’s ability to withstand malicious conditions.
- Readiness of a system’s security defenses must be proactively assessed to ensure they are battle-ready and operating as intended.
- The security capability toolchain must be instrumented from end to end to drive new insights into not only the effectiveness of the functionality within the toolchain but also to discover where added value and improvement can be injected.
- Practiced instrumentation seeks to identify, detect, and remediate failures in security controls.
- The focus is on vulnerability and failure identification, not failure management.
- The operational effectiveness of incident management is sharpened.
Chaos Engineering sounds like a scary idea. In actuality, these principles align well with the gospel of resilience that is well developed in other disciplines. Northeastern University President Joseph Aoun defines resilience as 1) the ability to adapt to changing conditions; 2) the ability to bounce back from stress and trauma; and 3) the ability to reinvent ourselves and return stronger than before. Or, as Judith Rodin puts it in the Resilience Dividend, resilience is “the capacity… to pre-emptively prepare for sudden disruptions that were unpredicted, to recover from them and then to take advantage of new opportunities produced by the disruption for further growth and expansion.”
The goal with Chaos Engineering isn’t of course to engineer chaos but to engineer out chaos by identifying where and how things can go wrong and taking mitigating actions to reduce harm and speed recovery. That is the essence of resilience as a concept.
Kolton Andrus, CEO of “chaos-as-a-service” startup Gremlin has pivoted the way he talks about what Gremlin does to “Resilience-as-a-Service.”
It’s safe to predict that Chaos Engineering won’t find its way into the Glossary of the next version of the NIST Profile. Resilience Testing though (same concept, more boring) should have its own sub-category.