Methodology · Reliability Engineering

Circuit Error Budget. SRE for the grid.

Google runs search at 99.95% availability. Your utility runs a feeder at 99.97%. The tools to reason about that budget exist, and they weren't invented at utilities. We're porting them over.

CEB · SERVICE-AREA WEST · FEEDER 104-B Q3 2025 · ROLLING 90D
Error budget · customer-minutes
SLO 99.97% · budget 131.4 cust-min / month
Planned workscheduled maintenance
28%36.8 min
Equipment failuretransformer · cable
41%53.9 min
Weathertree · wind · ice
19%25.0 min
Third partyvehicle · dig-in
8%10.5 min
Unknownunresolved root cause
12%15.8 min
TOTAL BURN 108% · over budget by 10.6 min
↻ live · updated 2m ago ceb v1.3 · burn-rate 1.4x
§ 01 · The Premise

The software world spent fifteen years turning reliability into math. Utilities have the same math problem.

Translation table
Service Level Objective SAIDI / SAIFI target Error budget Allowable customer-minutes Burn rate alert SAIDI pace-to-year forecast Blameless postmortem Outage root-cause review Change freeze Switching moratorium Chaos engineering N-1 contingency drill

The vocabulary differs; the underlying math is the same probability theory. What changes is that utilities historically treated the budget implicitly, averaged annually, and reviewed after the fact.

Every utility publishes SAIDI and SAIFI targets. Most measure against them once a year, in a PUC filing, long after any action could have changed the outcome. That's not how modern reliability engineering works.

A software SRE team running a 99.95% service doesn't wait until December to check the budget. They measure burn rate continuously, alert when a week consumes a month of budget, and treat every error as telemetry rather than blame. The grid version is identical in structure: same math, same cadence, same discipline.

The work isn't inventing new math. It's porting five specific operational practices (budget accounting, burn-rate alerting, blameless postmortems, toil reduction, and change-freeze discipline) into utility reliability workflows. And convincing leadership that their line crews are, and always have been, SREs.

§ 02

The toolkit.

Six open-source modules. Each one ports a specific SRE practice into utility vocabulary, infrastructure, and data models. Use any subset.

M 01ceb-budget

Budget accounting.

Continuously rolls your SAIDI/SAIFI targets into remaining-minutes budgets, sliced by feeder, cause-code, and time window. Pulls from OMS, writes to Postgres.

$ ceb budget --feeder 104-B --window 30d
REMAINING: 4.2 min / 131.4
BURN RATE: 1.4x · 14d to zero
M 02ceb-alert

Burn-rate alerting.

Multi-window multi-burn-rate alerts in the Google SRE tradition, ported to SCADA/OMS event streams. Catches fast-burn failures in hours, slow-burn drift in weeks.

ALERT · 104-B · 1h-burn 14x
trigger: 3 outages in 47min
paged: ops-shift · acked 2m
M 03ceb-pm

Blameless postmortem.

Structured postmortem template + review workflow. Built to satisfy NERC/FERC documentation requirements while producing actionable learning, not CYA paperwork.

postmortem · 2025-Q3-014
root: relay misop · vendor FW
actions: 4 open · 1 closed
M 04ceb-toil

Toil measurement.

Quantifies repetitive, automatable work your line crews do. Identifies the top ten toil sources per district; recommends automation candidates ranked by hours-saved.

top toil · q3-2025
1. manual switching · 340h/mo
2. meter re-read · 220h/mo
M 05ceb-freeze

Change freeze discipline.

Enforces change-window rules when budget is exhausted. Integrates with your work-management system to block non-emergency switching when burn rate exceeds threshold.

FREEZE · active · 6d
reason: budget -15% · cat-3
exceptions: emergency only
M 06ceb-drill

Contingency drills.

Scheduled N-1 and N-1-1 drill runner. Executes controlled contingencies on a digital twin, measures actual response time, feeds results back into the budget.

drill · 2025-09-14
scenario: loss of sub-47
MTTR: 18min · target 30
§ 03

The operational loop.

Observe → decide → act → review. The same cycle any SRE team runs, adapted to the time constants and stakes of utility operations.

CIRCUIT error budget LOOP 01 · OBSERVE SCADA · OMS · AMI · SLOs continuous · 15min max lag 02 · DECIDE burn rate · freeze rules policy-driven · auditable 03 · ACT switching · dispatch · hold crew · automation · SCADA 04 · REVIEW blameless PM · drill feeds back into targets seconds → minutes minutes → hours hours → days days → quarter
§ 04

What a burn-rate chart looks like.

Actual SAIDI minutes consumed, plotted against the month's allowable budget, with short- and long-window burn rates. This is Q3 2025 on Feeder 104-B.

Feeder 104-B 90-DAY ROLLING · SLO 99.97%

1.4×
Burn rate · 1h
108%
Budget consumed
14d
Days to zero
140 105 70 35 0 MONTHLY BUDGET · 131.4 min transf · 28min storm · 34min relay · 22min Sep 01 Sep 08 Sep 15 Sep 22 Sep 30
Cumulative SAIDI Ideal pace Monthly budget Outage event
§ 05

What we hear, and what we answer.

The five questions that come up in every exploratory call with a reliability or operations executive.

Q 01 · ADOPTION
"Our crews have run this grid for 40 years. What does software vocabulary add?"

Your crews already run something very close to SRE practice; they just don't have the accounting to prove it to rate-setters, investors, or themselves. CEB gives the operational reality a name, a dashboard, and a defensible record. The crew practice doesn't change; the ability to show that practice working does.

Q 02 · REGULATORY
"How does this sit with NERC / FERC / our state PUC?"

The output is a more granular, more documented version of what you already file. We've structured the postmortem module to produce the exact artifacts most PUCs request in rate cases, and the budget ledger maps one-to-one onto annual SAIDI/SAIFI reporting. Regulators like it more, not less.

Q 03 · INTEGRATION
"We have SCADA, OMS, WMS, GIS, historian, CIS. Which ones does this talk to?"

The modules read from OMS event streams, SCADA historians, and WMS change records. They write back nowhere operational; everything is advisory and auditable. The adapters are written against standard protocols and interfaces that the common utility stacks expose; the integration layer is in the repo.

Q 04 · BLAMELESSNESS
"Blameless postmortem sounds good on a slide. Our union contract is more complicated."

Fair point. Bargaining-unit contracts are the real constraint, not the methodology. The template separates individual actions from systemic factors explicitly, and is structured so reports leaving the bargaining unit don't surface individual names. The practical work is a bargaining letter, not a methodology change. We can help draft the language.

Q 05 · ECONOMICS
"Every dollar of reliability spend has to clear a prudency test. Does the math hold?"

The burn-rate approach is more economically rigorous than calendar-averaged reporting, not less. It produces continuous marginal cost-of-reliability estimates, the exact quantity a prudency review is trying to reconstruct after the fact. The output is the shape of analysis regulators increasingly expect to see in reliability investment filings.

Next step

Read the essay, or run the diagnostic.

The 12,000-word methodology paper is the theoretical foundation. The one-week reliability diagnostic is the practical starting point. Most utilities do both, in that order.

Read the methodology essay → Schedule a 30-min intro