Prerequisite: Complete Guide 04: Predictive Asset Maintenance first. This guide extends the binary "will it fail?" classifier into a survival model that answers "when will it fail?" and estimates remaining useful life.
What You Will Learn
In Guide 04 you built an XGBoost classifier that labels each transformer as "likely to fail" or "not." That is useful for ranking, but it throws away a crucial dimension: time. Survival analysis models the full time-to-failure distribution, which lets you estimate when an asset will fail—not just whether it will. In this guide you will:
- Prepare survival data with time-to-event and censoring indicators
- Fit Kaplan–Meier survival curves and compare groups of transformers
- Run log-rank tests to determine whether survival differences are statistically significant
- Build a Cox Proportional Hazards regression model with asset and environmental covariates
- Interpret hazard ratios to understand which factors accelerate failure
- Predict individual survival curves and remaining useful life for every transformer
- Create a risk-weighted replacement priority schedule that accounts for consequence
Why survival analysis? Traditional classification discards information about transformers that are still running. Survival analysis treats these as "censored" observations—we know they survived at least this long, even if we don't know when they will eventually fail. This makes much better use of your data, especially when failures are rare.
SP&L Data You Will Use
- transformers.csv (
load_transformers()) — ~21,000 transformers with age_years, kva_rating, manufacturer, phase, and status - outage_history.csv (
load_outage_history()) — outage events with feeder_id, cause_code, affected_customers, and equipment_involved (we derive maintenance proxy from equipment-failure outages) - weather_data.csv (
load_weather_data()) — 8,760 hourly records with temperature, humidity, wind, solar irradiance, heatwave and storm flags
Additional Libraries
Having trouble? Check our Troubleshooting Guide for solutions to common setup and data loading issues.
Load Transformer and Event Data
We start by loading the same datasets from Guide 04, plus weather data that we will use later to build environmental covariates for the Cox model.
Outage events: 160
Weather records: 8,760
Survival Analysis vs Classification
Before we write more code, let's understand why survival analysis is different from the binary classifier you built in Guide 04.
In Guide 04 you asked: "Will this transformer fail?" The answer was 0 or 1. But consider two transformers that have never failed:
- Transformer A: Installed in 2020 (6 years old, no failure)
- Transformer B: Installed in 1985 (40 years old, no failure)
A binary classifier treats both the same: has_failed = 0. But Transformer B surviving 40 years is much more informative than Transformer A surviving 6 years. Survival analysis captures this through a concept called censoring.
What is censoring? A transformer is "right-censored" if it has not failed by the end of our observation period. We know it survived at least this long, but we don't know its true failure time. Instead of throwing away this data (as binary classification effectively does), survival analysis uses it to estimate the survival function more accurately.
Survival analysis gives us three outputs that classification cannot:
- Survival curve: The probability of surviving beyond any given age
- Hazard function: The instantaneous failure rate at a given age
- Remaining useful life: A per-asset estimate of expected time until failure
Prepare Survival Data
Survival analysis requires two columns for each subject: the duration (time from installation to failure or current date) and an event indicator (1 = failure observed, 0 = still running / censored).
Failed (event=1): 4,891
Censored (event=0): 16,224
Duration range: 0.5 – 45.0 years
Important: Using only the first failure per transformer is a simplification. In reality, a transformer can be repaired and fail again. More advanced models (recurrent event models) handle this, but the single-event approach is the right starting point and is standard practice in asset management.
Sample size consideration: With ~21,000 transformers in the SP&L dataset, you have a substantial population for survival analysis. The number of observed failures determines statistical power. The standard rule of thumb for Cox PH is 10–15 events per covariate. With 6 covariates, you ideally want 60–90 failure events—well within reach given the thousands of transformers on feeders that have experienced equipment-failure outages. The penalizer=0.1 helps stabilize estimates via ridge regularization, especially when covariates are correlated. You can also stratify by manufacturer, voltage class, or service territory to identify subpopulation-specific risk factors.
Kaplan–Meier Survival Curves
The Kaplan–Meier estimator is the foundation of survival analysis. It produces a non-parametric estimate of the survival function: the probability of surviving beyond a given time. No assumptions about the shape of the curve are needed.
The curve starts at 1.0 (all transformers alive at year 0) and drops as failures occur. The steps down represent observed failures; the flat segments between steps include censored observations. The shaded band shows the 95% confidence interval.
Compare Survival by Manufacturer
Do transformers from different manufacturers fail at different rates? Plot separate Kaplan–Meier curves to find out.
Compare Survival by kVA Rating Group
Log-Rank Test: Are Differences Significant?
Eyeballing curves is useful, but the log-rank test gives a formal statistical answer to the question: "Do these two groups have significantly different survival experiences?"
Interpreting the log-rank test: A p-value below 0.05 means there is less than a 5% chance the observed difference is due to random chance alone. This helps you decide whether manufacturer selection (or feeder assignment, or kVA rating) genuinely affects transformer longevity, or whether apparent differences are just noise in a small dataset.
Cox Proportional Hazards Model
Kaplan–Meier curves compare groups, but they don't handle multiple continuous covariates simultaneously. The Cox Proportional Hazards (PH) model is the regression equivalent for survival data. It estimates how each covariate affects the hazard (failure rate) while controlling for the others.
Concordance index: This is the survival analysis equivalent of AUC. It measures the model’s ability to correctly rank pairs of transformers by their failure time. A concordance of 0.5 means the model is no better than random guessing; 1.0 means perfect discrimination. Values above 0.65 are considered useful for clinical/engineering applications.
Why a penalizer? Even with ~21,000 transformers, adding an L2 penalty (penalizer=0.1) prevents overfitting and stabilizes coefficient estimates. This is equivalent to ridge regularization and is especially valuable when covariates like age and weather exposure are correlated. A stronger penalty (0.1 instead of 0.01) avoids singular-matrix errors that can occur with correlated covariates.
Interpret Hazard Ratios
The Cox model output includes hazard ratios for each covariate. A hazard ratio greater than 1 means the factor increases failure risk; less than 1 means it is protective.
Reading hazard ratios: If age_years has an HR of 1.04, it means each additional year of age increases the failure hazard by 4%. Conversely, if kva_rating_scaled has an HR of 0.85, each 100 kVA increase reduces hazard by 15% (larger units may be newer or better maintained). These are powerful, interpretable insights for asset managers.
Predict Individual Survival Curves
The Cox model can produce a personalized survival curve for each transformer, conditioned on its specific covariates. This gives you remaining useful life (RUL) estimates for every asset in the fleet.
What is remaining useful life? RUL estimates how many years a transformer has left before it reaches a 50% failure probability. A transformer with RUL = 2.3 years is expected to reach its median lifetime in about 2.3 years. Transformers with RUL near zero have already exceeded their predicted median and are operating on borrowed time.
Build a Replacement Priority Ranking
Survival analysis tells you when a transformer is likely to fail. But not all failures have equal consequences. A transformer serving 200 customers matters more than one serving 10. Combining failure risk with consequence produces an actionable replacement schedule.
Tuning the weights: The 60/40 split between failure likelihood and consequence is a common approach in utility asset management, but the specific weights are illustrative. In practice, these weights should be calibrated by the utility’s engineering and planning teams based on their risk tolerance, regulatory framework, and capital budget constraints. Some utilities weight consequence more heavily to protect large commercial customers or critical facilities. Others use a pure risk-based approach (probability × consequence). The right weights depend on your utility’s specific risk framework—there is no universal standard.
Weather-Adjusted Seasonal Risk
Transformer failure risk is not constant throughout the year. Extreme heat accelerates oil degradation, and storm seasons increase mechanical stress. Let's create a seasonal risk adjustment using the weather data.
Practical application: If summer is the peak risk season (multiplier of 1.4x), schedule replacements of high-priority transformers in spring. This avoids the double risk of operating aging assets during the most stressful period while also managing the risk of a construction outage during peak demand.
Model Persistence and Feature Notes
Feature engineering rationale: We used 6 covariates with clear asset management interpretations. age_years captures cumulative wear. kva_rating proxies for equipment size and consequence of failure. feeder_failure_count and avg_outage_hours are derived from outage history and indicate feeders with historically problematic assets. avg_extreme_weather_hrs captures environmental stress exposure over the transformer’s lifetime.
What You Built and Next Steps
- Prepared survival data with time-to-event durations and censoring indicators
- Fitted Kaplan–Meier curves to visualize fleet-wide and group-level survival
- Used log-rank tests to determine statistically significant survival differences
- Built a Cox Proportional Hazards model with asset and environmental covariates
- Interpreted hazard ratios to understand which factors accelerate failure
- Predicted individual survival curves and remaining useful life estimates
- Created a risk-weighted replacement priority schedule with consequence scoring
- Adjusted priorities for seasonal weather risk to time capital projects
Ideas to Try Next
- Time-varying covariates: Use
lifelines.CoxTimeVaryingFitterto model how changing conditions (degrading health index over time) affect hazard - Accelerated Failure Time models: Try
lifelines.WeibullAFTFitterorLogNormalAFTFitterfor parametric alternatives to Cox PH - Extend to other assets: Apply the same survival framework to feeders (
load_feeders()) or network edges (load_network_edges()) with conductor age and type - Budget optimization: Use the priority ranking with replacement cost estimates to maximize risk reduction per dollar of capital spending
- Concordance index tuning: Evaluate model discrimination with
cph.concordance_index_and tune the penalizer and feature set
Key Terms Glossary
- Survival analysis — a statistical framework for modeling time-to-event data, accounting for censored observations
- Censoring — when the event of interest (failure) has not yet occurred for a subject; the true failure time is unknown but at least as long as the observed time
- Kaplan–Meier estimator — a non-parametric method for estimating the survival function from time-to-event data
- Log-rank test — a statistical test comparing survival distributions between two or more groups
- Cox Proportional Hazards — a semi-parametric regression model that estimates how covariates affect the hazard (failure rate)
- Hazard ratio — the multiplicative effect of a one-unit change in a covariate on the failure rate; HR > 1 means increased risk
- Remaining useful life (RUL) — the estimated time remaining before an asset reaches its predicted median failure point
- Consequence scoring — weighting failure risk by the impact of that failure (affected_customers, kva_rating lost)