BGP, FLISR, and the shape of a failure you can afford

Every utility distribution engineer has, at some point, been handed a SAIDI target. The number usually arrives from the state commission, or from rate-case testimony, or from a benchmarking report that pulled it off a NERC chart. The number itself (128 minutes, 94 minutes, 212 minutes) gets treated like a grade on a report card. You hit it or you don't. If you don't, you write a paragraph explaining the weather.

That's not how reliability works in any other infrastructure industry. It's certainly not how reliability works at Google, at Cloudflare, at the hyperscalers who invented the phrase "site reliability engineering." In those organizations the SAIDI-equivalent isn't a scorecard. It's a currency. You get given an amount. You spend it. If you underspend it, you should release more features. If you overspend it, you freeze.

That's the foundational idea of SRE, and it ports cleanly onto the distribution grid. The rest of this essay is the port.

§ 01What BGP got right in 1994

The internet's outer-core routing protocol, BGP-4, was finalized in 1994. Nobody on the original spec committee believed BGP would survive the decade. They shipped it anyway because the alternative, waiting for a better protocol, meant shipping nothing. The bet was that a reliability regime built around acceptable failure rates would out-compete a regime built around no failures.

They were right. BGP still routes the internet. And the reason it works is that the organizations running it (Level 3, NTT, Cogent, the hyperscalers) don't try to prevent failures. They budget them.

The rate of change of the system is a function of how much failure budget the team has left to spend. — SRE Book, Ch. 3

That sentence, with two words changed, is how distribution reliability should work. Replace "team" with "operating area" and "spend" with "accumulate" and you have a coherent reliability methodology for a feeder. But to see why, we need to translate one more concept.

§ 02FLISR is the physical-layer equivalent

In distribution, FLISR (Fault Location, Isolation, and Service Restoration) is the closest thing to an internet-grade self-healing protocol that the industry ships. It's what happens when a midline recloser senses a permanent fault, communicates with its upstream and downstream peers, opens the right switches, closes the right ties, and restores service to the un-faulted section in (if everything works) 60 to 90 seconds.

FLISR is to a distribution feeder what BGP path reconvergence is to an AS-to-AS routing relationship. Both are automated isolation-and-restoration protocols triggered by a fault signal. Both assume failure is a normal operating mode. Both trade latency during the failure for durability after it.

Neither of them eliminates failure. They budget it.

Fig. 01 · BGP AS reconvergence vs. FLISR feeder reconvergence. Different physics. Same contract.

§ 03Translating SAIDI into a budget

Here's the translation. Your commission gives you a SAIDI target of, say, 128 minutes per customer per year. That target isn't a threshold; it's an annual budget. You're allowed to spend up to 128 customer-minutes per customer. Spend them wisely.

Because you have 365 days to spend 128 minutes, you can derive a daily burn rate: 0.351 minutes/customer/day. And because you have a feeder-level distribution of outage severity, you can derive a per-event opportunity cost. A six-hour outage on a 620-customer lateral costs you 9.9 customer-minutes of system-wide SAIDI. You just spent 7.7% of your annual budget in one afternoon.

burn_rate(t) = (SAIDI_YTD / SAIDI_budget) ÷ (days_elapsed / 365) EQ. 01

The burn rate is the single number every operations lead should have above their desk. Above 1.0, you're overspending your budget, and the second half of the year needs to cost less than the first. Below 1.0, you have slack, and slack is permission to take risk.

Counterintuitive corollary A team whose burn rate is 0.6 for two years running isn't a well-run team. It's an under-utilized team. They have budget they aren't spending, which means they aren't pushing new configurations, aren't piloting new automation, aren't taking the risks their reliability allowance paid for. In SRE terms: they owe features.

Fig. 02 · Feeder 104-B burn rate, 2026 YTD. The team opened the year overspending (Feb storm event) and has been spending the rest of the year earning the budget back.

§ 04The organizational implication

Once you have a budget, you have the thing every reliability organization has always wanted and never had: a principled answer to "should we do this risky thing?" If the burn rate is under 1.0, and the thing is within your remaining budget, and the thing pays back long-term, the answer is yes. If the burn rate is already at 1.2, the answer is no, and the answer is specifically "no, we're freezing new configurations until the burn rate comes back."

This is the SRE concept of a reliability freeze. A small, boring, organizational lever. The most valuable piece of machinery an operations team can install.

What a freeze actually looks like on a distribution system

No new tie-switch automation commissions until burn rate returns below 1.1
No non-essential SCADA point additions. Every new point is a new failure surface
Rollback pending: the last configuration change is flagged for reversion if burn rate stays high at the next week-close
Any pilot whose risk envelope overlaps the over-budget feeder is paused

None of this is punitive. It's the same logic a software SRE team uses when they freeze deploys during a blast-radius incident. It works because everyone knows the rule before the rule matters. The rule is in the budget, and the budget isn't secret.

The point of a reliability budget isn't that you hit it. The point is that when you miss, everyone in the room already knew what happens next.

§ 05What this is not

This methodology doesn't replace the state commission's reliability reporting. You'll still file your Form 3-M, still explain the weather in a paragraph, still be benchmarked against your peers. What it does is give your internal team a faster, finer-grained instrument than the commission's annual scorecard, one that lets you make tradeoffs in March instead of discovering them in October.

It also doesn't need new hardware, new vendors, or a transformation program. You can compute a burn rate in a DuckDB query against the outage-history table you already have. The hard part isn't the data. The hard part is the organizational agreement that the budget is the budget.

§ 06Next in the series

Part 02 looks at what your utility is already doing that counts as SRE without being called that: FLISR as network failover, reclosers as retry-with-backoff, storm drills as chaos engineering. The techniques are already in production. What's missing is the operating discipline that ties them together.

— Adam · adam@sgridworks.com · Mar 11, 2026

BGP, FLISR, and the shape of a failure you can afford.