Methodology · Reliability Engineering

Circuit Error Budget. SRE for the grid.

Google runs search at 99.95% availability. Your utility runs a feeder at 99.97%. The tools to reason about that budget exist, and they weren't invented at utilities. We're porting them over.

See the toolkit → Read the essay

CEB · SERVICE-AREA WEST · FEEDER 104-B Q3 2025 · ROLLING 90D

Error budget · customer-minutes

SLO 99.97% · budget 131.4 cust-min / month

Planned workscheduled maintenance

28%36.8 min

Equipment failuretransformer · cable

41%53.9 min

Weathertree · wind · ice

19%25.0 min

Third partyvehicle · dig-in

8%10.5 min

Unknownunresolved root cause

12%15.8 min

TOTAL BURN 108% · over budget by 10.6 min

↻ live · updated 2m ago ceb v1.3 · burn-rate 1.4x

§ 01 · The Premise

The software world spent fifteen years turning reliability into math. Utilities have the same math problem.

Translation table

Service Level Objective → SAIDI / SAIFI target Error budget → Allowable customer-minutes Burn rate alert → SAIDI pace-to-year forecast Blameless postmortem → Outage root-cause review Change freeze → Switching moratorium Chaos engineering → N-1 contingency drill

The vocabulary differs; the underlying math is the same probability theory. What changes is that utilities historically treated the budget implicitly, averaged annually, and reviewed after the fact.

Every utility publishes SAIDI and SAIFI targets. Most measure against them once a year, in a PUC filing, long after any action could have changed the outcome. That's not how modern reliability engineering works.

A software SRE team running a 99.95% service doesn't wait until December to check the budget. They measure burn rate continuously, alert when a week consumes a month of budget, and treat every error as telemetry rather than blame. The grid version is identical in structure: same math, same cadence, same discipline.

The work isn't inventing new math. It's porting five specific operational practices (budget accounting, burn-rate alerting, blameless postmortems, toil reduction, and change-freeze discipline) into utility reliability workflows. And convincing leadership that their line crews are, and always have been, SREs.

§ 02

The toolkit.

Six open-source modules. Each one ports a specific SRE practice into utility vocabulary, infrastructure, and data models. Use any subset.

M 01ceb-budget

Budget accounting.

Continuously rolls your SAIDI/SAIFI targets into remaining-minutes budgets, sliced by feeder, cause-code, and time window. Pulls from OMS, writes to Postgres.

$ ceb budget --feeder 104-B --window 30d
REMAINING: 4.2 min / 131.4
BURN RATE: 1.4x · 14d to zero

M 02ceb-alert

Burn-rate alerting.

Multi-window multi-burn-rate alerts in the Google SRE tradition, ported to SCADA/OMS event streams. Catches fast-burn failures in hours, slow-burn drift in weeks.

ALERT · 104-B · 1h-burn 14x
trigger: 3 outages in 47min
paged: ops-shift · acked 2m

M 03ceb-pm

Blameless postmortem.

Structured postmortem template + review workflow. Built to satisfy NERC/FERC documentation requirements while producing actionable learning, not CYA paperwork.

postmortem · 2025-Q3-014
root: relay misop · vendor FW
actions: 4 open · 1 closed

M 04ceb-toil

Toil measurement.

Quantifies repetitive, automatable work your line crews do. Identifies the top ten toil sources per district; recommends automation candidates ranked by hours-saved.

top toil · q3-2025
1. manual switching · 340h/mo
2. meter re-read · 220h/mo

M 05ceb-freeze

Change freeze discipline.

Enforces change-window rules when budget is exhausted. Integrates with your work-management system to block non-emergency switching when burn rate exceeds threshold.

FREEZE · active · 6d
reason: budget -15% · cat-3
exceptions: emergency only

M 06ceb-drill

Contingency drills.

Scheduled N-1 and N-1-1 drill runner. Executes controlled contingencies on a digital twin, measures actual response time, feeds results back into the budget.

drill · 2025-09-14
scenario: loss of sub-47
MTTR: 18min · target 30

§ 05

What we hear, and what we answer.

The five questions that come up in every exploratory call with a reliability or operations executive.

Q 01 · ADOPTION

"Our crews have run this grid for 40 years. What does software vocabulary add?"

Your crews already run something very close to SRE practice; they just don't have the accounting to prove it to rate-setters, investors, or themselves. CEB gives the operational reality a name, a dashboard, and a defensible record. The crew practice doesn't change; the ability to show that practice working does.

Q 02 · REGULATORY

"How does this sit with NERC / FERC / our state PUC?"

The output is a more granular, more documented version of what you already file. We've structured the postmortem module to produce the exact artifacts most PUCs request in rate cases, and the budget ledger maps one-to-one onto annual SAIDI/SAIFI reporting. Regulators like it more, not less.

Q 03 · INTEGRATION

"We have SCADA, OMS, WMS, GIS, historian, CIS. Which ones does this talk to?"

The modules read from OMS event streams, SCADA historians, and WMS change records. They write back nowhere operational; everything is advisory and auditable. The adapters are written against standard protocols and interfaces that the common utility stacks expose; the integration layer is in the repo.

Q 04 · BLAMELESSNESS

"Blameless postmortem sounds good on a slide. Our union contract is more complicated."

Fair point. Bargaining-unit contracts are the real constraint, not the methodology. The template separates individual actions from systemic factors explicitly, and is structured so reports leaving the bargaining unit don't surface individual names. The practical work is a bargaining letter, not a methodology change. We can help draft the language.

Q 05 · ECONOMICS

"Every dollar of reliability spend has to clear a prudency test. Does the math hold?"

The burn-rate approach is more economically rigorous than calendar-averaged reporting, not less. It produces continuous marginal cost-of-reliability estimates, the exact quantity a prudency review is trying to reconstruct after the fact. The output is the shape of analysis regulators increasingly expect to see in reliability investment filings.