You call it storm drills. Netflix calls it chaos engineering. Both are the same idea: deliberately inject controlled failures to prove your system can handle them. Utilities have been doing this for decades on physical infrastructure. They just haven't formalized it the way tech companies have.
Netflix shipped Chaos Monkey in 2010 — a tool that randomly killed virtual machines in production to force teams to design for resilience. The idea was simple and counterintuitive: if you want a system that survives failure, practice failing. The chaos-engineering tools market reached about $6 billion in 2024 (SkyQuest Technology) and is projected to hit $40 billion by 2033. A Forrester Consulting study found 245% ROI on resilience investments that included chaos engineering. Organizations running formal chaos programs commonly report 40-60% reductions in mean time to repair, with some reaching 70% when combined with automated remediation.
Those numbers come from software. The principles apply directly to physical infrastructure, and utilities already do more chaos engineering than they realize.
§ 01Five principles, grid-translated
The discipline rests on five principles, originally codified by Netflix engineers and now widely adopted. Each one has a direct analog in grid operations.
Build a hypothesis around steady state
Define what "normal" looks like before you break anything. For a grid, that means baseline voltage profiles, frequency stability, feeder load balance, SCADA response times. If you don't know what normal looks like, you can't measure how your system responds to stress.
Vary real-world events
Test with realistic failures, not contrived ones. A transformer doesn't fail in a vacuum. It fails during a heat wave, when load is at peak and a nearby generating unit just tripped offline. Good chaos experiments layer conditions the way real incidents do.
Run experiments in production
Staging environments never perfectly replicate production. The most valuable learning comes from testing the actual system under actual conditions. For grids, that means moving beyond tabletop exercises toward controlled experiments on live infrastructure, with appropriate safeguards.
Automate experiments to run continuously
One-off tests find point-in-time issues. Continuous, automated testing finds regressions. Every firmware update, every new DER interconnection, every seasonal load change can introduce new failure modes. Continuous testing catches them before customers do.
Minimize blast radius
Start small. Contain the damage. Every experiment needs an abort trigger and a rollback plan. In grid ops that means starting in digital twins, graduating to microgrids, and only then touching production feeders. Never test on infrastructure you can't isolate.
§ 02What utilities already do
The argument that chaos engineering is foreign to the utility industry doesn't survive contact with reality. Utilities already perform controlled failure testing across multiple domains. They just call it something else.
| Utility practice | Chaos engineering equivalent |
|---|---|
| Planned maintenance outages | Controlled downtime injection |
| Storm drills | Large-scale failure simulation |
| Live line work | Extreme controlled risk |
| Protection relay testing | Fault injection |
| Black start exercises | Total system recovery testing |
| FLISR during actual faults | Production chaos (unplanned) |
Each of these practices embodies one or more of the five principles. Protection relay testing is fault injection with a measured hypothesis. Storm drills are large-scale failure simulations that vary real-world events. Black start exercises are total system recovery testing, the grid equivalent of Netflix pulling the plug on an entire AWS region.
Live line work deserves special attention because it says something about the industry's actual risk tolerance. Lineworkers maintain energized equipment at voltages up to 1,150 kV using conductive suits that function as Faraday cages. They work within the electric field, not insulated from it but immersed in it. The practice relies on rigorous procedure, continuous monitoring, and well-understood physics. If the utility industry were truly risk-averse, live line work wouldn't exist. Utilities already accept controlled risk when it's managed properly. The question is whether they apply the same disciplined approach to their increasingly complex cyber-physical systems.
§ 03What's missing
Existing utility testing practices are valuable but incomplete. Five specific gaps separate ad-hoc testing from formal chaos engineering.
| Gap | Current state | What formal chaos adds |
|---|---|---|
| Automation | Mostly manual tests | Automated, continuous testing |
| Production testing | Usually simulated | Safe production experiments |
| Metrics | Qualitative assessments | Quantified hypotheses and results |
| Cross-system | Siloed testing | End-to-end system experiments |
| Randomization | Predictable schedules | Randomized, unannounced tests |
The siloed-testing gap is particularly consequential. Protection engineers test relays. SCADA teams test communication links. DER groups test inverter behavior. Nobody tests what happens when a relay misoperates because SCADA sent corrupted data during a DER ramp event. Real failures chain across boundaries. Testing should too. NERC's own Lessons Learned reports document cases where solar inverters tripped offline during grid frequency events, causing cascading generation loss no single-domain test would have predicted. The 2021 Texas crisis showed the same pattern at system scale: gas supply, generator weatherization, and grid dispatch failed together in ways no siloed test had ever examined.
§ 04The four-phase safety framework
No reasonable person suggests starting chaos engineering by tripping a 345 kV transmission line during peak load. The path from concept to production follows a deliberate progression that increases exposure as confidence builds.
Phase 1: Digital twin chaos (months 1-6)
All experiments begin in simulation. Using digital twins of grid infrastructure, teams inject failures with zero physical risk. Candidate experiments include DER communications failures (what happens when 500 rooftop solar inverters lose their control signal simultaneously), SCADA degradation (how operators respond when telemetry updates slow from 2-second to 30-second intervals), cascading fault propagation (where does a breaker failure at Substation A create overloads), and cyber-attack scenarios (what a compromised relay at a critical bus actually does to system stability). Phase 1 builds the tooling, trains the teams, generates baseline data. Nothing touches real equipment.
Phase 2: Hardware-in-the-loop (months 6-12)
Real controllers connect to simulated grid models. Physical relays, RTUs, and inverter controllers respond to simulated conditions, including three-phase faults, voltage sags and swells, and frequency deviations. This phase catches firmware bugs, configuration errors, and interoperability issues pure simulation misses. The grid model is virtual, but the equipment responses are real.
Phase 3: Limited field experiments (months 12-18)
Testing moves to physical infrastructure under tight constraints. Experiments run at microgrid scale only, on systems that can be islanded from the bulk grid. All scenarios are pre-approved with documented abort criteria. Automatic protection stays active throughout. Tests run during low-load periods to minimize customer impact. This phase validates that digital twin predictions match physical reality. Discrepancies between simulation and field results become inputs for improving the models.
Phase 4: Production monitoring (ongoing)
Every real disturbance becomes a learning event. Instead of treating unplanned outages as problems to fix and forget, Phase 4 treats them as unplanned experiments to analyze. Rapid post-event analysis compares actual system behavior against model predictions. Continuous model validation keeps digital twins calibrated. Over time, the line between planned chaos experiments and operational monitoring blurs. Every event, planned or unplanned, feeds the same reliability improvement loop.
Start in a digital twin. Graduate to hardware-in-the-loop. Move to microgrid scale. Only then touch production. Every step has an abort trigger.
§ 05Governance scales with risk
Not every experiment needs executive sign-off, and not every experiment belongs in production. A risk-to-approval matrix keeps the program moving without bypassing oversight.
| Risk level | Experiment type | Environment | Approval |
|---|---|---|---|
| Low | Single-component failure | Digital twin | Team lead |
| Medium | Subsystem test | Hardware-in-the-loop | Department head |
| High | Microgrid scenario | Microgrid | VP Operations |
| Critical | Cross-system cascade | Microgrid / production-adjacent | C-suite + board |
§ 06The regulatory angle
Utility executives reasonably worry about PUC reactions to "experiments" on regulated infrastructure. Framed correctly, chaos engineering reduces regulatory risk rather than increasing it. Every state PUC tracks reliability metrics. Every NERC standard requires demonstrated capability. Every rate case depends on showing prudent investment decisions. Chaos engineering directly supports all three: verified SAIDI/SAIFI improvement through tested automation, demonstrated NERC CIP-009 recovery plan compliance, and provable ROI on reliability investments.
The framing matters. You aren't gambling with the grid. You're proving it works under conditions you can control, so it keeps working under conditions you can't.
§ 07Next in the series
Part 05 walks through how chaos-validated reliability improvements show up in IEEE 1366 indices — the SAIDI, SAIFI, and CAIDI numbers regulators actually track. SRE doesn't replace 1366. It makes it a leading indicator instead of a lagging one.
— Adam · adam@sgridworks.com · Mar 25, 2026