← Back to Insights
Reliability Engineering

SRE Doesn't Replace IEEE 1366. It Makes It Better.

Series: SRE for Power Grids — Part 5 of 7
Part 1: Why Your Grid Is Already Running SRE | Part 2: The Grid Is a Network (Coming Soon) | Part 3: Why N-1/N-2 Can't Keep Up (Coming Soon) | Part 4: Chaos Engineering for the Grid (Coming Soon) | Part 5: SRE + IEEE 1366 | Part 6: The $10 Million SAIDI Improvement (Coming Soon) | Part 7: Building SRE Culture (Coming Soon)

IEEE 1366 reliability indices -- SAIDI (System Average Interruption Duration Index), SAIFI (System Average Interruption Frequency Index), CAIDI (Customer Average Interruption Duration Index) -- are regulatory requirements that are not going away. Defined in IEEE Std 1366-2022 and required by 35+ state public utility commissions (per LBNL and EIA reporting surveys), they serve as annual report cards for regulators, benchmarking tools for the industry, and accountability mechanisms for utilities. Every reliability engineer in the country knows these numbers. They know the filing deadlines, the peer benchmarks, and the consequences of missing targets.

So when we talk about applying Site Reliability Engineering to grid operations, the first question from any experienced utility professional is fair: "How does this fit with what we already report?"

The answer is straightforward. SRE does not replace IEEE 1366. SRE is the operational engine. SAIDI and SAIFI are the report card. One drives daily decisions. The other summarizes annual outcomes. The problem most utilities face is that they have the report card without the engine. They know last year's SAIDI was 120 minutes (close to the national average of 118-126 minutes excluding major events, per recent EIA Electric Power Annual data). They do not know, with any confidence, what this year's SAIDI will be until December.

The Temporal Hierarchy

SRE and IEEE 1366 are not competing frameworks. They operate at entirely different timescales, and understanding this temporal hierarchy is what makes them complementary.

Layer Timeframe Purpose Metrics
SRE Operations Real-time to 30 days Operational guidance, early warning SLIs, SLOs, burn rates
SRE Tactical 30 to 90 days Improvement prioritization Error budget trends, toil analysis
SRE Strategic 90 days to 1 year Investment planning Reliability forecasting
IEEE 1366 Annual Regulatory reporting, benchmarking SAIDI, SAIFI, CAIDI

The relationship flows in one direction. SRE practices run continuously and produce improved operations. Improved operations produce better IEEE 1366 metrics at the annual filing. Real-time SLIs (Service Level Indicators) predict SAIDI and SAIFI trends months in advance, giving operators the ability to intervene proactively rather than explain poor results after the fact.

As we discussed in Part 3, traditional contingency planning like N-1 and N-2 analysis gives you a snapshot of system capability at a single point in time. SRE's continuous measurement fills the gap between those snapshots. IEEE 1366 provides the annual summary at the end. Together, they form a complete picture: real-time awareness, continuous improvement, and standardized reporting.

How SRE Drives SAIDI, SAIFI, and CAIDI Down

The connection between SRE practices and IEEE 1366 outcomes is not abstract. Each SRE mechanism maps to specific improvements in specific indices. Here is how that works across all three.

SAIDI Improvement Mechanisms

SAIDI (System Average Interruption Duration Index) measures total outage duration per customer served. Every minute saved across every outage compounds into SAIDI improvement.

SRE Practice Mechanism SAIDI Impact
FLISR Response Time SLO Faster fault location and restoration Direct CAIDI reduction leading to SAIDI improvement
Automated Switching Reduced manual dispatch time Minutes saved per outage event
Predictive Maintenance Prevent failures before occurrence Fewer, shorter outages
Burn Rate Alerting Early warning of degrading reliability Intervene before trend becomes outage

SAIFI Improvement Mechanisms

SAIFI (System Average Interruption Frequency Index) counts how often customers experience interruptions. Reducing SAIFI requires preventing outages entirely, not just restoring faster.

SRE Practice Mechanism SAIFI Impact
Fault Detection SLI Faster identification of incipient failures Prevent interruptions before they occur
Condition Monitoring Predictive asset replacement Reduce equipment-driven failures
Chaos Engineering Validate protection systems under stress Prevent false trips and cascading failures
Toil Reduction More time for preventive work Shift from reactive to proactive maintenance

CAIDI Improvement Mechanisms

CAIDI (Customer Average Interruption Duration Index) is the ratio of SAIDI to SAIFI: total customer-minutes of interruption divided by total customer interruptions. It is often treated as a proxy for average restoration time, and automation can dramatically improve it. But CAIDI deserves careful interpretation, especially at utilities deploying grid modernization.

SRE Practice Mechanism CAIDI Impact
FLISR Automation Sub-minute restoration vs. hours Reduces CMI and CI; net CAIDI effect depends on outage mix (see caution below)
Dispatch Automation Faster crew assignment and routing Reduced response time for manual restoration
Mobile Outage Management Real-time crew tracking and coordination Optimized restoration sequences across multiple events

The FLISR numbers are not theoretical. Utilities deploying automated FLISR report 30-70% CAIDI reductions depending on circuit topology and automation coverage (documented in EPRI distribution automation studies and individual utility rate case filings from FPL, ComEd, and others). None of these mechanisms require abandoning existing processes. They augment what reliability teams already do by adding measurement, targets, and feedback loops to practices that often exist only as institutional knowledge or tribal procedure.

A Caution on CAIDI in Grid Modernization

IEEE Std 1366-2022 includes Annex D, an informative appendix titled "Understanding CAIDI," dedicated entirely to cautioning against oversimplifying this index. The annex opens by noting that "CAIDI is a frequently misunderstood electric distribution reliability index" and warns against treating it at face value as a measure of average interruption duration.

The reason matters for any utility deploying automation. CAIDI equals CMI (customer minutes of interruption) divided by CI (customer interruptions). When FLISR automates restoration for a large number of short-duration faults, it reduces CI dramatically because many customers who would have experienced a sustained interruption now experience a momentary one (or none at all). But the remaining interruptions that require manual crew response tend to be longer and harder to fix. If CMI does not drop proportionally to CI, CAIDI goes up even though the utility is objectively performing better. SAIDI improves. SAIFI improves. But the ratio skews.

This is not a theoretical concern. Utilities in the middle of grid modernization programs have reported CAIDI increases in regulatory filings while SAIDI and SAIFI were both improving. Without the Annex D context, that looks like degrading restoration performance. With it, it reflects a healthier system where automation has removed the easy fixes from the denominator.

The SRE approach helps here. Because SRE tracks SLIs at the operational level, not just aggregate indices, you can decompose CAIDI into automated-restoration events and manual-restoration events. Report them separately. Show regulators that automated CAIDI is sub-minute while manual CAIDI is stable or improving. This is exactly the kind of contextual reporting that Annex D encourages and that a single CAIDI number obscures.

Error Budgets as Outage Budgets

The SRE concept of an error budget translates directly into the utility world as an outage budget. The math is simple, and it connects real-time operations to annual regulatory outcomes.

Instead of starting from an abstract availability percentage, start from the number that matters: your SAIDI target. If your regulatory target is 120 minutes per year, your monthly outage budget is 10 minutes. That is the amount of cumulative customer-interruption-minutes you can consume each month and still hit your annual number.

Now define burn rate as the ratio of actual budget consumption to planned budget consumption. If you consume exactly 10 minutes of SAIDI in January, your burn rate is 1.0x. Project forward from there:

  • Burn rate 0.5x (5 min/month): Trending toward 60 minutes per year. Excellent performance. You have budget to take calculated risks or accelerate modernization work.
  • Burn rate 1.0x (10 min/month): Tracking to 120 minutes per year. On target, steady state.
  • Burn rate 2.0x (20 min/month): Tracking to 240 minutes per year. Double the target. Warrants immediate investigation and intervention.
  • Budget exhausted in week one (40+ min consumed): Annualized rate over 480 minutes. Emergency footing. All discretionary work stops.

The burn rate becomes the leading indicator that SAIDI and SAIFI have always lacked. Instead of waiting until year-end to discover you missed your target, you see the trend developing in real time. And because the math is anchored to your actual regulatory target, not a generic availability SLO, every operator in the control room immediately understands what the numbers mean.

Burn Rate Monthly SAIDI Consumption Projected Year-End SAIDI Operational Response
0.5x 5 min 60 min (50% of target) Healthy. Spend budget on modernization risk.
1.0x 10 min 120 min (on target) Steady state. Monitor for seasonal upticks.
2.0x 20 min 240 min (2x target) Investigate. Prioritize worst feeders.
4x+ 40+ min 480+ min (breach certain) Emergency. All discretionary work stops.

A 0.5x burn rate also signals something operationally useful: you have budget to take risks. That might mean scheduling maintenance during a period with moderate weather risk, or deploying a firmware update to field devices that could temporarily reduce automation coverage. In SRE terms, you spend error budget on innovation. In utility terms, you use reliability margin to accelerate grid modernization.

A 4x+ burn rate triggers a different response entirely. All discretionary maintenance stops. Automation systems get additional monitoring. Crews pre-stage in high-risk areas. The burn rate gives operations leadership a quantitative basis for those decisions instead of relying on gut feel or waiting for the next storm to confirm what they already suspected.

SRE burn rates also help with a persistent headache in IEEE 1366 reporting: major event days (MEDs). IEEE 1366 excludes days that exceed a statistical threshold (2.5 Beta method), but the line between "normal operations" and "major event" is a frequent source of regulatory dispute. By tracking baseline SLIs and burn rates excluding storm days, utilities create cleaner forecasts for normal operations and build a stronger evidentiary record when regulators review MED exclusions. The data trail shows exactly when performance shifted from normal degradation to event-driven, with timestamps, not judgment calls.

The Regulatory Conversation Shift

Utilities file IEEE 1366 metrics with state public utility commissions annually. These filings are backward-looking by design. They tell regulators what happened. They say nothing about what is happening now or what will happen next. SRE changes that conversation.

Metric Traditional Reporting SRE-Enhanced Reporting
SAIDI Reported Last year's actual Current year projected with confidence interval
Trend Direction Unknown until year-end Real-time burn rate with monthly granularity
Intervention Status Reactive, after target breach Proactive measures documented and timestamped
Confidence Interval Single point estimate Probabilistic forecast with upper and lower bounds
Financial Impact Discovered after PBR penalty assessed Mid-year forecast ties SAIDI trajectory to penalty/incentive exposure

Consider the difference in regulatory narrative.

Traditional: "Our SAIDI was 120 minutes last year. We plan to do better this year."

SRE-enhanced: "Our SAIDI is trending toward 90 minutes this year. We detected increased burn rate in Q2 due to storm activity, implemented enhanced FLISR automation on the three worst-performing feeders, and have 95% confidence of meeting our 100-minute target."

The second statement demonstrates operational control. It shows the utility knows where it stands, why it stands there, what it did about emerging problems, and where it expects to land. Regulators and intervenors respond differently to a utility that demonstrates this level of situational awareness versus one that delivers a single number and a vague improvement plan.

This matters even more in jurisdictions with performance-based ratemaking. When reliability targets carry financial penalties or incentives, the ability to forecast SAIDI mid-year and course-correct is not just operationally useful. It directly affects revenue. A utility that discovers in November it will miss its SAIDI target has no time to respond. A utility tracking burn rates in real time can intervene in June.

Three-Phase Implementation

Connecting SRE operations to IEEE 1366 outcomes is not a rip-and-replace effort. It layers onto existing reliability programs in three phases.

Phase 1: Correlation Analysis (Months 1 to 3)

Start with historical data. Correlate existing operational metrics with past SAIDI and SAIFI outcomes. Most utilities already collect SCADA data, outage management system records, and field crew response times. The work in this phase is analytical: identify which measurable indicators best predict IEEE 1366 results. Not every metric matters equally. In one Midwestern utility's historical regression, FLISR response time and vegetation-related fault frequency explained 68% of annual SAIDI variance (R² = 0.68). Two variables, two-thirds of the outcome explained. Find your high-leverage SLIs before building anything.

Phase 2: SLO Setting (Months 3 to 6)

With correlation analysis complete, set SLO (Service Level Objective) targets for the SLIs that matter most. These targets should guarantee IEEE 1366 compliance with a safety margin. If your SAIDI regulatory target is 120 minutes and your analysis shows that a FLISR response time SLO of under 5 minutes correlates with SAIDI below 100 minutes, set the SLO at 5 minutes. The 20-minute buffer between your projected outcome and your regulatory target is your safety margin. It accounts for major event days, model uncertainty, and the unexpected. Error budgets flow from these SLOs and become the operational currency for decision-making.

Phase 3: Operational Integration (Months 6 to 12)

Build real-time IEEE 1366 forecasting dashboards driven by SRE burn rates. Integrate these forecasts into monthly reliability review meetings and quarterly regulatory filings where applicable. Train operations staff to read burn rates the same way they read system load forecasts: as actionable predictions, not retrospective statistics. By the end of this phase, the utility should be able to answer the question "Where will our SAIDI land this year?" on any given day, with a confidence interval, and explain what levers are available to change the trajectory.

The total investment is modest relative to most grid modernization programs. The primary cost is analytical and organizational, not capital. Most of the data already exists. The gap is in how it is used.

From Reporting to Operating

IEEE 1366 was designed for regulatory reporting and industry benchmarking. It does that job well. But it was never designed to guide daily operations. The annual timescale is too slow. The aggregated metrics are too coarse. No control room operator has ever made a real-time switching decision based on projected annual SAIDI.

SRE fills that operational gap. SLIs give operators real-time signals. SLOs give them targets. Error budgets give them decision-making authority. Burn rates give managers early warning. And at the end of the year, the IEEE 1366 numbers reflect the cumulative effect of thousands of better-informed operational decisions.

The framework is complementary by design. SRE does not ask utilities to stop reporting SAIDI. It asks them to stop being surprised by it.

In Part 6, we put dollars on this. A 30-minute SAIDI improvement for a 1,000 MW utility avoids $10 million per year in outage costs. The SRE program that delivers it costs $2 million. The economics are more favorable than most utilities expect.

Previously: ← Part 4: Chaos Engineering for the Grid (Coming Soon)
Next in series: Part 6: The $10 Million SAIDI Improvement (Coming Soon)

About Sisyphean Gridworks

Sisyphean Gridworks helps utilities measure, manage, and improve grid reliability using the same operational discipline that keeps the internet running. We work with operations teams to implement SLOs, error budgets, and structured incident analysis so that reliability decisions are driven by data, not habit.