Applying chaos engineering principles to community resilience
by John Plodinec, Community and Regional Resilience Institute
October 9, 2018
Put a Little Chaos in Your Community
As I’ve mentioned before, I’m an inveterate reader (Motto: no cereal box is safe from me). One of the things I like to do is apply ideas I’ve read about from one field to another completely different one. Sometimes, it’s a square peg in a round hole; but sometimes…
Recently I stumbled across an article about chaos engineering. This is a computer science technique aimed at increasing the robustness and resilience of complex distributed software systems. As I dug into it, I began to see its potential relevance to communities and their resilience.
Chaos engineering doesn’t just look at the software, it considers the entire system: software, hardware and people. This may involve multiple programs, run on many different servers, with input from people or from other programs. Periodically, user demand leads to addition of new features but with the expectation that the system will remain reliable. And yet, even if each individual program is operating “correctly” sometimes the system produces unreliable output. In general, faulty communications among the different parts of the distributed system most often are the root cause of these problems.
Communities are also complex distributed systems. Just within a city government there are many “programs” – the policies and procedures of its bureaucracies – that are continuously interacting. When they are in sync, we receive the services we expect, but when they aren’t… And then when you add in all of the other entities in a community – businesses, NGO’s, neighborhoods, families, individuals – you have the quintessential complex distributed system!
Chaos engineering is focused on determining the reliability of the system’s output. Off-normal real-world occurrences (some big, some small) are introduced more or less at random into the system’s operations and the output monitored. Sometimes this is done on the actual systems in use, sometimes on shadow systems that simulate those in production.
Chaos engineering’s principles and techniques are not very different than those used by emergency managers: exercise emergency systems with simulated disasters, inject one or more unexpected conditions into the exercise, monitor the outcomes, fix any problems. What is different is how systematically these are applied in practice, and not just in emergency situations. The aim is to find problems and fix them before they’re encountered in practice.
We can systematically apply chaos engineering’s principles to our community systems, even – especially – those in routine use.
Know how we’re connected. I am continually amazed at how oblivious we often are to the “goes-inta’s” and the “comes outta’s” of our community systems. Whether it’s a social, economic or infrastructural system, there are inputs and outputs. With a little effort, most community leaders will recognize the connections within their community; seldom are the connections beyond the community recognized.
One of my favorite examples of this is poultry production in North Carolina. Each year, the state’s farmers produce almost a billion birds for meat. When hit with the double whammy of Dennis and Floyd in 1999, the state’s poultry producers awoke to the unpleasant reality that they all were heavily dependent on a single feed source whose facilities were impacted by the heavy flooding. This led to some diversification of suppliers as well as an insistence on business continuity plans as a contract requirement.
Require regular testing of every system, including its inputs and outputs. One system’s output at the edges of its acceptable range may cause bad things to happen when it is the input to another system. Any changes to one system (e.g., to make it more efficient) should be constrained by what follow-on systems can tolerate.
Regular testing need not be a special event held annually. Communities often use collegiate athletic events as an opportunity to test how well emergency management, transportation and other systems work together.
Introduce a little chaos into the testing. Emergency managers know that injects – introducing unexpected oops’es – often yield the most illuminating test results. And the testing doesn’t have to be “live” to be valuable. For example, President Nunez (Eastern Connecticut State University) starts out each of her staff meetings looking at an incident that happened somewhere else, asking “What if it happened here? What would we do? What, if anything, do we need to change?”
Change is the one constant for our communities. Yet we expect – often, demand – that the services we’ve come to rely on are available whenever we want them. Taking a chaos engineering-type approach seems to be a great way to find and fix potential problems before we encounter them in real life. Thus, chaos engineering can be a path to more resilient communities.
Editor’s note: This blog was originally published by the Community & Regional Resilience Institute (CARRI) and has been reposted with permission from the author. To learn more about CARRI, visit www.resilientus.org