103. Chaos Engineering

Code[ish]

Content provided by Salesforce Engineering. All podcast content including episodes, graphics, and podcast descriptions are uploaded and provided directly by Salesforce Engineering or their podcast platform partner. If you believe someone is using your copyrighted work without your permission, you can follow the process outlined here https://player.fm/legal.

3+ y ago

MP3

Rick Newman interviews Mikolaj Pawlikowski, who recently wrote a book called "Chaos Engineering: Crash test your applications." The theory behind chaos engineering is to "break things on purpose" in your operational flow. You want to deliberately inject failures that might occur in production ahead of time, in order to anticipate them, and thus implement workarounds and corrections. Typically, this practice is often used for large, distributed systems, because of the many points of failure, but it can be useful in any architecture.

One of the obstacles to embracing chaos engineering is finding high level approval from other teammates, or even managers. Even after the feature is a complete and the unit tests are passing, it can be difficult to convince someone that some resiliency work needs to continue, because there's no visible or tangible benefit to preparing for a disaster. Mikolaj suggests that people clearly lay out to other colleagues the ways a system can fail, and the impact it can have on the application or business. Rather than try to fear monger, it can be useful to point to other companies' availability issues as words of caution for their teams to embrace. Mikolaj also says that chaos engineering doesn't need to focus solely on complicated problems like race conditions across distributed systems. Often, there's enough low hanging fruit, such as disk space running out or an API timing out, that can be useful to consider fixing.

The chaos engineering mindset can also extend beyond pure software. If you think about people working across different timezones as a distributed system, you can also optimize for failures in communication before they occur. Everyone works at a different pace, and communication issues can be analogous to a network loss. Rather than fix miscommunications after they occur, establishing shared practices (like writing down every meeting, or setting up playbooks) can go a long way to ensuring that everyone will be able to do their best under changing circumstances.

Links from this episode

Mikolaj's book is called Chaos Engineering: Crash test your applications -- get a 40% discount using the code podish19
powerfulseal is a testing tool for Kubernetes clusters
Mikolaj distributes the Chaos Engineering Newsletter
Conf42 is a conference focusing on high-level computer science
ChaosConf is the world’s largest Chaos Engineering event
Awesome Chaos Engineering is a curated list of Chaos Engineering resources

132 episodes

#Programming #Software Development #Tech #Chris Castles