Writing · SRE for the Grid · Part 04

Chaos engineering for the grid: breaking things safely on purpose.

You call it storm drills. Netflix calls it chaos engineering. Same idea. Utilities have been doing this for decades on physical infrastructure and just haven't formalized it yet.

Adam BrownAuthor
13 minReading time
Mar 25, 2026Published
Part 4 of 7SRE for the Grid

You call it storm drills. Netflix calls it chaos engineering. Both are the same idea: deliberately inject controlled failures to prove your system can handle them. Utilities have been doing this for decades on physical infrastructure. They just haven't formalized it the way tech companies have.

Netflix shipped Chaos Monkey in 2010 — a tool that randomly killed virtual machines in production to force teams to design for resilience. The idea was simple and counterintuitive: if you want a system that survives failure, practice failing. The chaos-engineering tools market reached about $6 billion in 2024 (SkyQuest Technology) and is projected to hit $40 billion by 2033. A Forrester Consulting study found 245% ROI on resilience investments that included chaos engineering. Organizations running formal chaos programs commonly report 40-60% reductions in mean time to repair, with some reaching 70% when combined with automated remediation.

Those numbers come from software. The principles apply directly to physical infrastructure, and utilities already do more chaos engineering than they realize.

§ 01Five principles, grid-translated

The discipline rests on five principles, originally codified by Netflix engineers and now widely adopted. Each one has a direct analog in grid operations.

Build a hypothesis around steady state

Define what "normal" looks like before you break anything. For a grid, that means baseline voltage profiles, frequency stability, feeder load balance, SCADA response times. If you don't know what normal looks like, you can't measure how your system responds to stress.

Vary real-world events

Test with realistic failures, not contrived ones. A transformer doesn't fail in a vacuum. It fails during a heat wave, when load is at peak and a nearby generating unit just tripped offline. Good chaos experiments layer conditions the way real incidents do.

Run experiments in production

Staging environments never perfectly replicate production. The most valuable learning comes from testing the actual system under actual conditions. For grids, that means moving beyond tabletop exercises toward controlled experiments on live infrastructure, with appropriate safeguards.

Automate experiments to run continuously

One-off tests find point-in-time issues. Continuous, automated testing finds regressions. Every firmware update, every new DER interconnection, every seasonal load change can introduce new failure modes. Continuous testing catches them before customers do.

Minimize blast radius

Start small. Contain the damage. Every experiment needs an abort trigger and a rollback plan. In grid ops that means starting in digital twins, graduating to microgrids, and only then touching production feeders. Never test on infrastructure you can't isolate.

§ 02What utilities already do

The argument that chaos engineering is foreign to the utility industry doesn't survive contact with reality. Utilities already perform controlled failure testing across multiple domains. They just call it something else.

Utility practice Chaos engineering equivalent
Planned maintenance outagesControlled downtime injection
Storm drillsLarge-scale failure simulation
Live line workExtreme controlled risk
Protection relay testingFault injection
Black start exercisesTotal system recovery testing
FLISR during actual faultsProduction chaos (unplanned)
Fig. 01 · Six things every utility already does. Chaos engineering is the formal name for most of them.

Each of these practices embodies one or more of the five principles. Protection relay testing is fault injection with a measured hypothesis. Storm drills are large-scale failure simulations that vary real-world events. Black start exercises are total system recovery testing, the grid equivalent of Netflix pulling the plug on an entire AWS region.

Live line work deserves special attention because it says something about the industry's actual risk tolerance. Lineworkers maintain energized equipment at voltages up to 1,150 kV using conductive suits that function as Faraday cages. They work within the electric field, not insulated from it but immersed in it. The practice relies on rigorous procedure, continuous monitoring, and well-understood physics. If the utility industry were truly risk-averse, live line work wouldn't exist. Utilities already accept controlled risk when it's managed properly. The question is whether they apply the same disciplined approach to their increasingly complex cyber-physical systems.

§ 03What's missing

Existing utility testing practices are valuable but incomplete. Five specific gaps separate ad-hoc testing from formal chaos engineering.

Gap Current state What formal chaos adds
AutomationMostly manual testsAutomated, continuous testing
Production testingUsually simulatedSafe production experiments
MetricsQualitative assessmentsQuantified hypotheses and results
Cross-systemSiloed testingEnd-to-end system experiments
RandomizationPredictable schedulesRandomized, unannounced tests
Fig. 02 · Five gaps between storm drills and Chaos Monkey. Close them one at a time.

The siloed-testing gap is particularly consequential. Protection engineers test relays. SCADA teams test communication links. DER groups test inverter behavior. Nobody tests what happens when a relay misoperates because SCADA sent corrupted data during a DER ramp event. Real failures chain across boundaries. Testing should too. NERC's own Lessons Learned reports document cases where solar inverters tripped offline during grid frequency events, causing cascading generation loss no single-domain test would have predicted. The 2021 Texas crisis showed the same pattern at system scale: gas supply, generator weatherization, and grid dispatch failed together in ways no siloed test had ever examined.

Why randomization matters Predictable tests create predictable preparation. When crews know a drill is coming next Tuesday, they staff up and pre-stage equipment. That validates the response plan under ideal conditions. It tells you nothing about what happens at 2 AM on a Sunday with a skeleton crew. Unannounced tests reveal the gaps that scheduled drills hide. In regulated environments, fully unannounced production tests may require pre-coordination with state commissions or NERC. The randomization can still happen within an approved testing window or on isolated feeders. The point is that the teams under test shouldn't know the specific scenario or timing in advance.

§ 04The four-phase safety framework

No reasonable person suggests starting chaos engineering by tripping a 345 kV transmission line during peak load. The path from concept to production follows a deliberate progression that increases exposure as confidence builds.

Phase 1: Digital twin chaos (months 1-6)

All experiments begin in simulation. Using digital twins of grid infrastructure, teams inject failures with zero physical risk. Candidate experiments include DER communications failures (what happens when 500 rooftop solar inverters lose their control signal simultaneously), SCADA degradation (how operators respond when telemetry updates slow from 2-second to 30-second intervals), cascading fault propagation (where does a breaker failure at Substation A create overloads), and cyber-attack scenarios (what a compromised relay at a critical bus actually does to system stability). Phase 1 builds the tooling, trains the teams, generates baseline data. Nothing touches real equipment.

Phase 2: Hardware-in-the-loop (months 6-12)

Real controllers connect to simulated grid models. Physical relays, RTUs, and inverter controllers respond to simulated conditions, including three-phase faults, voltage sags and swells, and frequency deviations. This phase catches firmware bugs, configuration errors, and interoperability issues pure simulation misses. The grid model is virtual, but the equipment responses are real.

Phase 3: Limited field experiments (months 12-18)

Testing moves to physical infrastructure under tight constraints. Experiments run at microgrid scale only, on systems that can be islanded from the bulk grid. All scenarios are pre-approved with documented abort criteria. Automatic protection stays active throughout. Tests run during low-load periods to minimize customer impact. This phase validates that digital twin predictions match physical reality. Discrepancies between simulation and field results become inputs for improving the models.

Phase 4: Production monitoring (ongoing)

Every real disturbance becomes a learning event. Instead of treating unplanned outages as problems to fix and forget, Phase 4 treats them as unplanned experiments to analyze. Rapid post-event analysis compares actual system behavior against model predictions. Continuous model validation keeps digital twins calibrated. Over time, the line between planned chaos experiments and operational monitoring blurs. Every event, planned or unplanned, feeds the same reliability improvement loop.

Start in a digital twin. Graduate to hardware-in-the-loop. Move to microgrid scale. Only then touch production. Every step has an abort trigger.

§ 05Governance scales with risk

Not every experiment needs executive sign-off, and not every experiment belongs in production. A risk-to-approval matrix keeps the program moving without bypassing oversight.

Risk level Experiment type Environment Approval
LowSingle-component failureDigital twinTeam lead
MediumSubsystem testHardware-in-the-loopDepartment head
HighMicrogrid scenarioMicrogridVP Operations
CriticalCross-system cascadeMicrogrid / production-adjacentC-suite + board
Fig. 03 · Approval tiers for chaos experiments. Higher risk, more eyes.

§ 06The regulatory angle

Utility executives reasonably worry about PUC reactions to "experiments" on regulated infrastructure. Framed correctly, chaos engineering reduces regulatory risk rather than increasing it. Every state PUC tracks reliability metrics. Every NERC standard requires demonstrated capability. Every rate case depends on showing prudent investment decisions. Chaos engineering directly supports all three: verified SAIDI/SAIFI improvement through tested automation, demonstrated NERC CIP-009 recovery plan compliance, and provable ROI on reliability investments.

The framing matters. You aren't gambling with the grid. You're proving it works under conditions you can control, so it keeps working under conditions you can't.

§ 07Next in the series

Part 05 walks through how chaos-validated reliability improvements show up in IEEE 1366 indices — the SAIDI, SAIFI, and CAIDI numbers regulators actually track. SRE doesn't replace 1366. It makes it a leading indicator instead of a lagging one.

— Adam · adam@sgridworks.com · Mar 25, 2026

Part 05 →

SRE doesn't replace IEEE 1366. It makes it better.

Part 5 of 7 · SRE for the Grid

Read next →