Writing · SRE for the Grid · Part 02

Why your grid is already running SRE.

You already have FLISR. You already run storm drills. You already rely on reclosers to self-heal most faults. What you do not have is a single framework that ties those capabilities together — and measures the result.

Adam BrownAuthor
12 minReading time
Mar 3, 2026Published
Part 2 of 7SRE for the Grid

Your utility already practices most of what Google calls Site Reliability Engineering. FLISR is network failover. Reclosers are retry-with-backoff. Storm drills are chaos engineering. Protection relay testing is fault injection. Black start is disaster recovery. You just don't call them that, and you don't do them systematically.

That last part is the whole problem. The techniques are sound. They've evolved over a century of keeping the lights on. But each one sits in its own lane with its own team and its own scorecard, and the coordination between them is mostly nonexistent. What SRE actually adds is the operating framework: targets that span the silos, feedback loops that catch drift early, one playbook instead of six.

This is the second piece in the seven-part port. Part 01 argued that SAIDI is a currency, not a scorecard. Part 02 makes a narrower claim: you're already running most of the machinery. The connective tissue is what's missing.

§ 01The mapping table

Before going deeper, look at the parallels side by side. Every core SRE concept has a direct operational equivalent already deployed on most distribution systems.

SRE concept Grid equivalent How it works
Automated failover FLISR Detects a fault, isolates the failed segment, reroutes power through alternate paths. Same logic as shifting traffic to a healthy server.
Retry with backoff Reclosers After a fault trips a breaker, reclosers attempt to re-energize at increasing intervals. About 80% of faults (tree contact, animal strikes) self-clear. The line recovers without human intervention.
Chaos engineering Storm drills & tabletops Simulate major events to test response, crew coordination, and system recovery. Same intent as Netflix's Chaos Monkey: break things on purpose to find weaknesses before they find you.
Fault injection testing Protection relay testing Secondary injection sends synthetic fault signals to relays to verify they trip correctly and within tolerance. Validates the protection system before a real fault demands it.
Disaster recovery · cold start Black start Restoring from total blackout using designated generation that can start without external power. Analogous to rebuilding a data center from bare metal.
Fig. 01 · Five SRE primitives. Five grid practices. Different vocabulary, same probability theory.

These aren't loose metaphors. Production-grade capabilities, delivering real results. Ameren Missouri prevented 160,000 customer outages in 2025 using one of them, automated smart switching tied to FLISR. Not a pilot program. A real distribution system, real customers, measurably fewer minutes of outage.

The operational logic is identical to software SRE. In software, the techniques are wired into one framework with shared metrics, monitoring, and explicit reliability targets. At most utilities, they sit in separate programs run by separate groups.

§ 02FLISR is automated failover

FLISR is the clearest example. When a fault hits a distribution feeder, FLISR does three things: locates the fault from sensor data, isolates the faulted segment by opening switches, restores service by closing tie switches to healthy feeders. The whole sequence can finish in under a minute. For customers on the healthy segments, the outage barely registers.

A load balancer does exactly this when a server goes down. Detect the failure, pull the bad node from the pool, redirect traffic. The grid version has tighter constraints (you can't spin up a new feeder the way you spin up a cloud instance), but the pattern is the same.

Here's what makes the SRE framing useful. Ameren can measure what FLISR prevented. Most utilities can't. They know how many outages they had; they rarely know how many they avoided. SRE's emphasis on quantified reliability, tracking near-misses alongside failures, turns FLISR from a "good project" into a capability with a known contribution to system reliability.

You can't manage what you can't count. Counting what FLISR avoided is the difference between owning a capability and merely deploying one.

§ 03Reclosers are retry-with-backoff

Reclosers are the most elegant piece of grid automation on the distribution system, and they predate software retry logic by decades. When a fault trips a circuit, the recloser waits a defined interval, then re-energizes the line. If the fault persists, it trips again, waits longer, tries again. After a set number of attempts (typically three or four), it locks out and waits for a crew.

That's retry-with-backoff. The increasing interval is the backoff. The lockout is a circuit breaker in the software sense, not just the electrical one. And it works. About 80% of distribution faults are transient: a tree branch contacts a line, wind blows it clear, the line is fine. Reclosers handle those without dispatching a truck or registering a sustained outage.

In software terms, that's an 80% self-healing rate at the edge. Most distributed systems would celebrate those numbers. Most utilities don't mention them.

§ 04What is missing is the operating discipline

The individual practices are strong. Run them in isolation and you forfeit the compounding value. FLISR data could inform predictive maintenance priorities. Recloser patterns could feed real-time reliability dashboards. Storm drill findings could reshape how protection testing is scheduled. None of it happens when each practice lives in its own silo with its own metrics. Capability isn't the bottleneck. Coordination is. Three places that shows up:

Periodic instead of continuous

Protection relay testing runs on a cycle, often every few years. Storm drills run annually or seasonally. In SRE, fault injection and game days are continuous processes tied to deployment cycles and system changes. A relay that tested fine three years ago may not respond correctly to a fault profile that didn't exist three years ago, especially as DER penetration reshapes fault currents across the system.

Why this matters more every year A distribution system with 30% DER behind-the-meter has different fault current magnitudes, different directionality, and different islanding risks than the same system did at 5% DER. The relay settings, the protection coordination study, the test plan all need to move with it. A three-year test cycle can't keep up with a one-year change cycle.

Qualitative instead of measured

Utilities measure SAIDI, SAIFI, and CAIDI. These are trailing indicators reported annually. SRE operates on Service Level Objectives: quantified reliability targets with real-time monitoring. An SLO states: "this feeder will deliver 99.95% availability, measured as minutes of unplanned outage per month." That translates directly to an error budget.

error_budget(month) = (1 − SLO) × 43,200 min EQ. 01

At 99.95% availability, the error budget is 21.6 minutes per month. If a feeder burns through its budget by mid-month, specific things get triggered: expedited maintenance, delayed construction work, tighter inspection. At 99.9%, the budget is 43.8 minutes per month. Either way, every minute of outage debits the budget. Low budget, you slow down changes and focus on stability. Remaining budget, you can invest in upgrades and accept the associated risk.

The budget makes the tradeoff between reliability and progress explicit and measurable. Utilities that adopt error budgets manage reliability like a P&L line item.

Siloed instead of systematic

FLISR is a distribution automation project. Reclosers are a protection engineering function. Storm drills are an emergency management activity. Black start is a transmission planning procedure. Each one has its own metrics, its own budget, its own leadership chain. SRE consolidates them into a single reliability function with shared objectives. That consolidation is where the compounding returns come from.

§ 05The evidence: reliability is getting harder, not easier

This matters now because the grid's reliability trajectory is pointed the wrong direction. U.S. customers averaged roughly 11 hours of power outages in 2024, nearly double the average over the prior decade. Major weather events drive the headline numbers, but the trend persists even after adjusting for extreme events. The underlying system is getting harder to keep reliable.

The reasons are structural. An aging asset base (average transmission line age over 40 years in many regions), increasing weather severity, load growth from electrification and data centers, and the integration of millions of distributed energy resources that change how power flows through the system. The grid was designed as a one-directional delivery network. It's becoming a bidirectional, dynamically reconfiguring mesh. That complexity increase is not incremental.

For context on what's at stake: the 2003 Northeast blackout. Triggered by a software bug in an alarm system that interacted with overgrown trees contacting a transmission line. 55 million people without power, $6 billion in estimated cost. The grid equivalent of a distributed-system outage that starts with one failed health check and ends with total service collapse.

99.97 to 99.999 isn't a rounding error. Those two numbers describe two different levels of engineering discipline.

Compare reliability numbers across industries. Google's production systems target 99.999% availability, about 5.3 minutes of downtime per year. In a typical non-major-event year, the U.S. electric grid hovers around 99.97%. Orders of magnitude short of the five-nines discipline that runs the world's largest digital systems. 2024's 11-hour average shows how fast that margin disappears when conditions worsen.

No one expects the grid to hit five nines tomorrow. But the direction of travel matters, and right now the numbers are moving away from that target. The techniques to close part of that gap already exist inside utility operations. They just aren't wired together.

§ 06From techniques to a system

SRE takes the reliability practices you already own (FLISR, reclosers, protection testing, storm drills, black start) and makes them systematic, measured, and continuous. The raw material is already there. These are production capabilities delivering production results.

What they lack is connective tissue. Shared reliability targets across the silos. Error budgets that make the tradeoffs explicit. Continuous validation instead of periodic testing. Automated monitoring that catches degradation before it becomes an outage.

§ 07Next in the series

Part 03 makes the next argument. The tool utilities currently use to prove reliability is N-1/N-2 contingency planning, a pass/fail test for a world that's probabilistic. The 2003 blackout happened in a system where every component passed N-1. SRE gives you a better instrument. If reliability is a budget, contingency planning is the capital expenditure review.

— Adam · adam@sgridworks.com · Mar 3, 2026

Part 03 →

Why N-1/N-2 planning can't keep up with the modern grid.

Part 3 of 7 · SRE for the Grid

Read next →