Guide 12: Survival Analysis for Transformer Failure Prediction - ML Playground

Prerequisite: Complete Guide 04: Predictive Asset Maintenance first. This guide extends the binary "will it fail?" classifier into a survival model that answers "when will it fail?" and estimates remaining useful life.

What You Will Learn

In Guide 04 you built an XGBoost classifier that labels each transformer as "likely to fail" or "not." That is useful for ranking, but it throws away a crucial dimension: time. Survival analysis models the full time-to-failure distribution, which lets you estimate when an asset will fail—not just whether it will. In this guide you will:

Prepare survival data with time-to-event and censoring indicators
Fit Kaplan–Meier survival curves and compare groups of transformers
Run log-rank tests to determine whether survival differences are statistically significant
Build a Cox Proportional Hazards regression model with asset and environmental covariates
Interpret hazard ratios to understand which factors accelerate failure
Predict individual survival curves and remaining useful life for every transformer
Create a risk-weighted replacement priority schedule that accounts for consequence

Why survival analysis? Traditional classification discards information about transformers that are still running. Survival analysis treats these as "censored" observations—we know they survived at least this long, even if we don't know when they will eventually fail. This makes much better use of your data, especially when failures are rare.

SP&L Data You Will Use

assets/transformers.csv — 86 transformers with kVA rating, installation year, manufacturer, type, and health index (1–5)
assets/maintenance_log.csv — inspection dates, work orders, and replacement records
outages/outage_events.csv — historical outage events linked to equipment failures
weather/hourly_observations.csv — weather exposure history for environmental stress analysis

Additional Libraries

pip install lifelines

Load Transformer and Event Data

We start by loading the same datasets from Guide 04, plus weather data that we will use later to build environmental covariates for the Cox model.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from lifelines import KaplanMeierFitter, CoxPHFitter
from lifelines.statistics import logrank_test
from lifelines.utils import median_survival_times

DATA_DIR = "sisyphean-power-and-light/"

# Load asset data
transformers = pd.read_csv(DATA_DIR + "assets/transformers.csv")
maintenance  = pd.read_csv(DATA_DIR + "assets/maintenance_log.csv",
                            parse_dates=["inspection_date"])
outages      = pd.read_csv(DATA_DIR + "outages/outage_events.csv",
                            parse_dates=["fault_detected"])
weather      = pd.read_csv(DATA_DIR + "weather/hourly_observations.csv",
                            parse_dates=["timestamp"])

print(f"Transformers:      {len(transformers)}")
print(f"Maintenance logs:  {len(maintenance)}")
print(f"Outage events:     {len(outages)}")
print(f"Weather records:   {len(weather):,}")
                    

Transformers: 86
Maintenance logs: 1,247
Outage events: 3,247
Weather records: 43,824

Survival Analysis vs Classification

Before we write more code, let's understand why survival analysis is different from the binary classifier you built in Guide 04.

In Guide 04 you asked: "Will this transformer fail?" The answer was 0 or 1. But consider two transformers that have never failed:

Transformer A: Installed in 2020 (5 years old, no failure)
Transformer B: Installed in 1985 (40 years old, no failure)

A binary classifier treats both the same: has_failed = 0. But Transformer B surviving 40 years is much more informative than Transformer A surviving 5 years. Survival analysis captures this through a concept called censoring.

What is censoring? A transformer is "right-censored" if it has not failed by the end of our observation period. We know it survived at least this long, but we don't know its true failure time. Instead of throwing away this data (as binary classification effectively does), survival analysis uses it to estimate the survival function more accurately.

Survival analysis gives us three outputs that classification cannot:

Survival curve: The probability of surviving beyond any given age
Hazard function: The instantaneous failure rate at a given age
Remaining useful life: A per-asset estimate of expected time until failure

Prepare Survival Data

Survival analysis requires two columns for each subject: the duration (time from installation to failure or current date) and an event indicator (1 = failure observed, 0 = still running / censored).

# Identify transformers that experienced equipment failures
equip_failures = outages[outages["cause_code"] == "equipment_failure"]

# Get the first failure date per transformer
first_failure = equip_failures.groupby("transformer_id").agg(
    first_failure_date=("fault_detected", "min")
).reset_index()

# Merge with transformer inventory
df = transformers.merge(first_failure, on="transformer_id", how="left")

# Event indicator: 1 = failed, 0 = censored (still running)
df["event_observed"] = df["first_failure_date"].notna().astype(int)

# Duration: years from install to failure (or to current date if censored)
observation_end = pd.Timestamp("2025-01-01")
df["install_date"] = pd.to_datetime(df["install_year"], format="%Y")

df["duration_years"] = np.where(
    df["event_observed"] == 1,
    (df["first_failure_date"] - df["install_date"]).dt.days / 365.25,
    (observation_end - df["install_date"]).dt.days / 365.25
)

# Ensure positive durations
df["duration_years"] = df["duration_years"].clip(lower=0.5)

print(f"Total transformers:  {len(df)}")
print(f"Failed (event=1):   {df['event_observed'].sum()}")
print(f"Censored (event=0): {(df['event_observed'] == 0).sum()}")
print(f"\nDuration range: {df['duration_years'].min():.1f} – {df['duration_years'].max():.1f} years")
                    

Total transformers: 86
Failed (event=1): 23
Censored (event=0): 63

Duration range: 0.5 – 42.0 years

Important: Using only the first failure per transformer is a simplification. In reality, a transformer can be repaired and fail again. More advanced models (recurrent event models) handle this, but the single-event approach is the right starting point and is standard practice in asset management.

Sample size consideration: We have 86 transformers with only 23 observed failures. This is a small dataset for survival analysis. The standard rule of thumb for Cox PH is 10–15 events per covariate; with 6 covariates, you would ideally want 60–90 failure events. With 23 events, the hazard ratio estimates will have wide confidence intervals, and some effects may not reach statistical significance even if they are real. The penalizer=0.01 helps stabilize the estimates, but treat the specific numerical results with appropriate caution. To strengthen this analysis, you could expand the dataset by including transformers from neighboring utilities, extending the observation window, or incorporating failure records from additional asset types (e.g., reclosers, regulators) with similar failure mechanisms.

Kaplan–Meier Survival Curves

The Kaplan–Meier estimator is the foundation of survival analysis. It produces a non-parametric estimate of the survival function: the probability of surviving beyond a given time. No assumptions about the shape of the curve are needed.

# Fit overall Kaplan-Meier curve
kmf = KaplanMeierFitter()
kmf.fit(
    durations=df["duration_years"],
    event_observed=df["event_observed"],
    label="All Transformers"
)

fig, ax = plt.subplots(figsize=(10, 6))
kmf.plot_survival_function(ax=ax, color="#2D6A7A", linewidth=2)
ax.set_title("Kaplan–Meier Survival Curve: SP&L Transformer Fleet")
ax.set_xlabel("Years Since Installation")
ax.set_ylabel("Survival Probability")
ax.set_ylim(0, 1.05)
ax.axhline(y=0.5, color="gray", linestyle="--", alpha=0.5, label="50% survival")
ax.legend()
plt.tight_layout()
plt.show()

# Median survival time (50% of transformers have failed by this age)
median = kmf.median_survival_time_
print(f"Median survival time: {median:.1f} years")
                    

The curve starts at 1.0 (all transformers alive at year 0) and drops as failures occur. The steps down represent observed failures; the flat segments between steps include censored observations. The shaded band shows the 95% confidence interval.

Compare Survival by Manufacturer

Do transformers from different manufacturers fail at different rates? Plot separate Kaplan–Meier curves to find out.

# Split by manufacturer and plot separate KM curves
fig, ax = plt.subplots(figsize=(10, 6))
colors = ["#2D6A7A", "#5FCCDB", "#D69E2E", "#E53E3E"]

for i, (mfr, group) in enumerate(df.groupby("manufacturer")):
    kmf_mfr = KaplanMeierFitter()
    kmf_mfr.fit(
        durations=group["duration_years"],
        event_observed=group["event_observed"],
        label=mfr
    )
    kmf_mfr.plot_survival_function(
        ax=ax, color=colors[i % len(colors)], linewidth=2
    )

ax.set_title("Survival Curves by Manufacturer")
ax.set_xlabel("Years Since Installation")
ax.set_ylabel("Survival Probability")
ax.set_ylim(0, 1.05)
ax.legend(title="Manufacturer")
plt.tight_layout()
plt.show()
                    

Compare Survival by kVA Rating Group

# Create kVA rating groups
df["kva_group"] = pd.cut(
    df["kva_rating"],
    bins=[0, 50, 100, 500, 5000],
    labels=["Small (≤50)", "Medium (51-100)",
            "Large (101-500)", "Very Large (500+)"]
)

fig, ax = plt.subplots(figsize=(10, 6))

for i, (group_name, group) in enumerate(df.groupby("kva_group")):
    if len(group) < 3:
        continue
    kmf_kva = KaplanMeierFitter()
    kmf_kva.fit(
        durations=group["duration_years"],
        event_observed=group["event_observed"],
        label=group_name
    )
    kmf_kva.plot_survival_function(
        ax=ax, color=colors[i % len(colors)], linewidth=2
    )

ax.set_title("Survival Curves by kVA Rating Group")
ax.set_xlabel("Years Since Installation")
ax.set_ylabel("Survival Probability")
ax.set_ylim(0, 1.05)
ax.legend(title="kVA Group")
plt.tight_layout()
plt.show()
                    

Log-Rank Test: Are Differences Significant?

Eyeballing curves is useful, but the log-rank test gives a formal statistical answer to the question: "Do these two groups have significantly different survival experiences?"

# Compare two manufacturers statistically
manufacturers = df["manufacturer"].unique()
mfr_a = df[df["manufacturer"] == manufacturers[0]]
mfr_b = df[df["manufacturer"] == manufacturers[1]]

result = logrank_test(
    mfr_a["duration_years"], mfr_b["duration_years"],
    event_observed_A=mfr_a["event_observed"],
    event_observed_B=mfr_b["event_observed"]
)

print(f"Log-rank test: {manufacturers[0]} vs {manufacturers[1]}")
print(f"  Test statistic: {result.test_statistic:.3f}")
print(f"  p-value:        {result.p_value:.4f}")

if result.p_value < 0.05:
    print("  → Statistically significant difference (p < 0.05)")
else:
    print("  → No statistically significant difference (p ≥ 0.05)")
                    

# Run pairwise log-rank tests across all manufacturers
from itertools import combinations

print("Pairwise log-rank tests (all manufacturers):")
print(f"{'Group A':<20} {'Group B':<20} {'p-value':>10} {'Significant?':>14}")
print("-" * 66)

for a, b in combinations(manufacturers, 2):
    grp_a = df[df["manufacturer"] == a]
    grp_b = df[df["manufacturer"] == b]
    res = logrank_test(
        grp_a["duration_years"], grp_b["duration_years"],
        event_observed_A=grp_a["event_observed"],
        event_observed_B=grp_b["event_observed"]
    )
    sig = "Yes" if res.p_value < 0.05 else "No"
    print(f"{a:<20} {b:<20} {res.p_value:>10.4f} {sig:>14}")
                    

Interpreting the log-rank test: A p-value below 0.05 means there is less than a 5% chance the observed difference is due to random chance alone. This helps you decide whether manufacturer selection (or feeder assignment, or kVA rating) genuinely affects transformer longevity, or whether apparent differences are just noise in a small dataset.

Cox Proportional Hazards Model

Kaplan–Meier curves compare groups, but they don't handle multiple continuous covariates simultaneously. The Cox Proportional Hazards (PH) model is the regression equivalent for survival data. It estimates how each covariate affects the hazard (failure rate) while controlling for the others.

# Engineer covariates for the Cox model

# Maintenance features: work order count and days since last inspection
maint_feats = maintenance.groupby("transformer_id").agg(
    work_orders=("work_order_id", "count"),
    last_inspection=("inspection_date", "max")
).reset_index()
maint_feats["days_since_inspection"] = (
    pd.Timestamp("2025-01-01") - maint_feats["last_inspection"]
).dt.days
df = df.merge(
    maint_feats[["transformer_id", "work_orders", "days_since_inspection"]],
    on="transformer_id", how="left"
)
df["work_orders"] = df["work_orders"].fillna(0)
df["days_since_inspection"] = df["days_since_inspection"].fillna(9999)

# Weather exposure: extreme weather days per transformer
# Extreme defined as: wind > 35 mph, temperature > 100°F, or temperature < 10°F
weather["extreme_day"] = (
    (weather["wind_speed"] > 35) |
    (weather["temperature"] > 100) |
    (weather["temperature"] < 10)
).astype(int)

# Compute per-transformer lifetime extreme weather exposure
# Each transformer has been exposed to extreme weather for the years it has
# been in service. We compute the total extreme days during each transformer's
# specific installation period, giving a location-in-time metric.
weather["year"] = weather["timestamp"].dt.year
annual_extreme = weather.groupby("year")["extreme_day"].sum()

def compute_lifetime_exposure(row):
    """Sum extreme weather days from install year to present."""
    years_active = annual_extreme[
        (annual_extreme.index >= row["install_year"]) &
        (annual_extreme.index <= 2024)
    ]
    return years_active.sum()

df["lifetime_extreme_days"] = df.apply(compute_lifetime_exposure, axis=1)
# Annualize: average extreme days per year in service
df["avg_extreme_weather_days"] = df["lifetime_extreme_days"] / df["duration_years"].clip(lower=1)

# Encode transformer type as numeric
df["type_code"] = df["type"].map({"oil": 0, "dry": 1}).fillna(0)

print("Covariates prepared.")
print(df[["transformer_id", "health_index", "kva_rating",
          "work_orders", "days_since_inspection",
          "duration_years", "event_observed"]].head())
                    

# Fit the Cox Proportional Hazards model
cox_cols = [
    "duration_years", "event_observed",
    "health_index", "kva_rating", "type_code",
    "work_orders", "days_since_inspection",
    "avg_extreme_weather_days"
]

cox_df = df[cox_cols].dropna()

# Standardize large-range columns for numerical stability
cox_df["kva_rating_scaled"] = cox_df["kva_rating"] / 100
cox_df["days_since_insp_scaled"] = cox_df["days_since_inspection"] / 365
cox_df = cox_df.drop(columns=["kva_rating", "days_since_inspection"])

cph = CoxPHFitter(penalizer=0.01)
cph.fit(
    cox_df,
    duration_col="duration_years",
    event_col="event_observed"
)

cph.print_summary()

# Concordance index: the model's ability to correctly rank transformers
# by failure time. 0.5 = random guessing, 1.0 = perfect discrimination
print(f"\nConcordance index: {cph.concordance_index_:.3f}")
                    

Concordance index: This is the survival analysis equivalent of AUC. It measures the model’s ability to correctly rank pairs of transformers by their failure time. A concordance of 0.5 means the model is no better than random guessing; 1.0 means perfect discrimination. Values above 0.65 are considered useful for clinical/engineering applications.

Why a penalizer? With only 86 transformers and several covariates, we add a small L2 penalty (penalizer=0.01) to prevent overfitting. This is equivalent to ridge regularization and stabilizes coefficient estimates when the sample size is modest relative to the number of features.

Interpret Hazard Ratios

The Cox model output includes hazard ratios for each covariate. A hazard ratio greater than 1 means the factor increases failure risk; less than 1 means it is protective.

# Extract and visualize hazard ratios
summary = cph.summary

fig, ax = plt.subplots(figsize=(9, 5))

# Plot hazard ratios with confidence intervals
hr = summary["exp(coef)"]
hr_lower = summary["exp(coef) lower 95%"]
hr_upper = summary["exp(coef) upper 95%"]

y_pos = range(len(hr))
ax.barh(y_pos, hr.values, color="#5FCCDB", height=0.6, alpha=0.8)
ax.errorbar(hr.values, y_pos,
           xerr=[hr.values - hr_lower.values, hr_upper.values - hr.values],
           fmt="none", color="#2D6A7A", capsize=4)
ax.axvline(x=1.0, color="red", linestyle="--", alpha=0.7, label="HR = 1 (no effect)")
ax.set_yticks(y_pos)
ax.set_yticklabels(hr.index)
ax.set_xlabel("Hazard Ratio (95% CI)")
ax.set_title("Cox PH Hazard Ratios: What Accelerates Transformer Failure?")
ax.legend()
plt.tight_layout()
plt.show()

# Print interpretation
print("\nHazard ratio interpretation:")
for covariate in hr.index:
    ratio = hr[covariate]
    p_val = summary.loc[covariate, "p"]
    if ratio > 1:
        direction = "increases"
        pct = (ratio - 1) * 100
    else:
        direction = "decreases"
        pct = (1 - ratio) * 100
    sig = "*" if p_val < 0.05 else ""
    print(f"  {covariate}: HR={ratio:.3f} → {direction} hazard by {pct:.1f}% {sig}")
                    

Reading hazard ratios: If health_index has an HR of 0.65, it means each one-point increase in health index reduces the failure hazard by 35%. Conversely, if days_since_insp_scaled has an HR of 1.42, each additional year without inspection increases hazard by 42%. These are powerful, interpretable insights for asset managers.

Predict Individual Survival Curves

The Cox model can produce a personalized survival curve for each transformer, conditioned on its specific covariates. This gives you remaining useful life (RUL) estimates for every asset in the fleet.

# Predict survival curves for all transformers
surv_funcs = cph.predict_survival_function(cox_df)

# Plot survival curves for 5 specific transformers
sample_ids = df["transformer_id"].iloc[:5].values
fig, ax = plt.subplots(figsize=(10, 6))

for i, tid in enumerate(sample_ids):
    idx = df[df["transformer_id"] == tid].index[0]
    if idx in surv_funcs.columns:
        ax.plot(surv_funcs.index, surv_funcs[idx],
               label=f"{tid} (age={df.loc[idx, 'duration_years']:.0f}yr, HI={df.loc[idx, 'health_index']})",
               linewidth=2)

ax.set_title("Individual Predicted Survival Curves")
ax.set_xlabel("Years Since Installation")
ax.set_ylabel("Survival Probability")
ax.set_ylim(0, 1.05)
ax.legend(title="Transformer", fontsize=9)
plt.tight_layout()
plt.show()
                    

# Estimate remaining useful life (RUL) for each transformer
# RUL = median predicted survival time - current age

median_survival = cph.predict_median(cox_df)

df["predicted_median_life"] = median_survival.values
df["remaining_useful_life"] = df["predicted_median_life"] - df["duration_years"]
df["remaining_useful_life"] = df["remaining_useful_life"].clip(lower=0)

print("Remaining Useful Life Estimates (top 15 most urgent):")
rul_table = df.sort_values("remaining_useful_life")[[
    "transformer_id", "duration_years", "health_index",
    "predicted_median_life", "remaining_useful_life"
]].head(15)
print(rul_table.to_string(index=False, float_format="%.1f"))
                    

What is remaining useful life? RUL estimates how many years a transformer has left before it reaches a 50% failure probability. A transformer with RUL = 2.3 years is expected to reach its median lifetime in about 2.3 years. Transformers with RUL near zero have already exceeded their predicted median and are operating on borrowed time.

Build a Replacement Priority Ranking

Survival analysis tells you when a transformer is likely to fail. But not all failures have equal consequences. A transformer serving 200 customers matters more than one serving 10. Combining failure risk with consequence produces an actionable replacement schedule.

# Current failure hazard (predicted partial hazard from Cox model)
df["hazard_score"] = cph.predict_partial_hazard(cox_df).values

# Consequence score: kVA rating as a proxy for customers affected
# Normalize both to 0-1 scale
df["hazard_norm"] = (
    (df["hazard_score"] - df["hazard_score"].min()) /
    (df["hazard_score"].max() - df["hazard_score"].min())
)
df["consequence_norm"] = (
    (df["kva_rating"] - df["kva_rating"].min()) /
    (df["kva_rating"].max() - df["kva_rating"].min())
)

# Risk priority score = failure likelihood × consequence
df["priority_score"] = df["hazard_norm"] * 0.6 + df["consequence_norm"] * 0.4

# Rank by priority (highest score = replace first)
priority_list = df.sort_values("priority_score", ascending=False)

print("Top 10 Replacement Priorities:")
print(priority_list[[
    "transformer_id", "duration_years", "health_index",
    "kva_rating", "remaining_useful_life", "priority_score"
]].head(10).to_string(index=False, float_format="%.2f"))
                    

# Visualize: risk vs consequence scatter plot
fig, ax = plt.subplots(figsize=(10, 7))

scatter = ax.scatter(
    df["hazard_norm"], df["consequence_norm"],
    c=df["priority_score"], cmap="YlOrRd",
    s=80, edgecolors="white", linewidth=0.5, alpha=0.9
)
plt.colorbar(scatter, label="Priority Score")

# Label the top 5 highest-priority transformers
top5 = priority_list.head(5)
for _, row in top5.iterrows():
    ax.annotate(row["transformer_id"],
                (row["hazard_norm"], row["consequence_norm"]),
                fontsize=8, fontweight="bold",
                xytext=(5, 5), textcoords="offset points")

ax.set_xlabel("Failure Likelihood (normalized hazard)")
ax.set_ylabel("Consequence (normalized kVA)")
ax.set_title("Replacement Priority Matrix: Risk × Consequence")
ax.axvline(x=0.5, color="gray", linestyle="--", alpha=0.3)
ax.axhline(y=0.5, color="gray", linestyle="--", alpha=0.3)

# Quadrant labels
ax.text(0.75, 0.75, "REPLACE NOW", ha="center", fontsize=11,
       fontweight="bold", color="#E53E3E", alpha=0.7)
ax.text(0.25, 0.75, "MONITOR", ha="center", fontsize=11,
       fontweight="bold", color="#D69E2E", alpha=0.7)
ax.text(0.75, 0.25, "SCHEDULE", ha="center", fontsize=11,
       fontweight="bold", color="#D69E2E", alpha=0.7)
ax.text(0.25, 0.25, "LOW PRIORITY", ha="center", fontsize=11,
       fontweight="bold", color="#718096", alpha=0.7)

plt.tight_layout()
plt.show()
                    

Tuning the weights: The 60/40 split between failure likelihood and consequence is a common approach in utility asset management, but the specific weights are illustrative. In practice, these weights should be calibrated by the utility’s engineering and planning teams based on their risk tolerance, regulatory framework, and capital budget constraints. Some utilities weight consequence more heavily to protect large commercial customers or critical facilities. Others use a pure risk-based approach (probability × consequence). The right weights depend on your utility’s specific risk framework—there is no universal standard.

Weather-Adjusted Seasonal Risk

Transformer failure risk is not constant throughout the year. Extreme heat accelerates oil degradation, and storm seasons increase mechanical stress. Let's create a seasonal risk adjustment using the weather data.

# Monthly extreme weather frequency
weather["month"] = weather["timestamp"].dt.month
monthly_extreme = weather.groupby("month")["extreme_day"].mean()

# Create seasonal risk multiplier (normalized around 1.0)
seasonal_multiplier = monthly_extreme / monthly_extreme.mean()

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Seasonal risk multiplier
months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun",
          "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
bar_colors = ["#E53E3E" if v > 1.2 else "#D69E2E" if v > 1.0
              else "#5FCCDB" for v in seasonal_multiplier.values]

axes[0].bar(months, seasonal_multiplier.values, color=bar_colors)
axes[0].axhline(y=1.0, color="gray", linestyle="--", alpha=0.5)
axes[0].set_title("Seasonal Risk Multiplier")
axes[0].set_ylabel("Risk Multiplier (1.0 = average)")
axes[0].tick_params(axis="x", rotation=45)

# Plot 2: Adjusted priority scores for top 10 in peak month
peak_month = seasonal_multiplier.idxmax()
peak_mult = seasonal_multiplier.max()

top10 = priority_list.head(10).copy()
top10["adjusted_score"] = top10["priority_score"] * peak_mult

y_pos = range(len(top10))
axes[1].barh(y_pos, top10["priority_score"], height=0.4,
            color="#5FCCDB", label="Base Priority")
axes[1].barh([y + 0.4 for y in y_pos], top10["adjusted_score"],
            height=0.4, color="#E53E3E", label=f"Peak Month (×{peak_mult:.2f})")
axes[1].set_yticks([y + 0.2 for y in y_pos])
axes[1].set_yticklabels(top10["transformer_id"].values)
axes[1].set_xlabel("Priority Score")
axes[1].set_title("Top 10: Base vs Peak-Season Priority")
axes[1].legend()

plt.tight_layout()
plt.show()

print(f"\nPeak risk month: {months[peak_month - 1]} (multiplier: {peak_mult:.2f}×)")
print("Use seasonal multipliers to time capital replacement projects")
print("before high-risk seasons, not after.")
                    

Practical application: If summer is the peak risk season (multiplier of 1.4x), schedule replacements of high-priority transformers in spring. This avoids the double risk of operating aging assets during the most stressful period while also managing the risk of a construction outage during peak demand.

Model Persistence and Feature Notes

# Save the fitted Cox model (lifelines models are picklable)
import pickle
with open("cox_model.pkl", "wb") as f:
    pickle.dump(cph, f)

# Load it back
with open("cox_model.pkl", "rb") as f:
    cph = pickle.load(f)
                    

Feature engineering rationale: We used 6 covariates with clear asset management interpretations. The health_index (1–5) is a composite condition score from inspections. kva_rating proxies for both equipment size and consequence of failure. days_since_inspection captures maintenance recency—transformers that have not been inspected recently may have undetected degradation.

✓

What You Built and Next Steps

Prepared survival data with time-to-event durations and censoring indicators
Fitted Kaplan–Meier curves to visualize fleet-wide and group-level survival
Used log-rank tests to determine statistically significant survival differences
Built a Cox Proportional Hazards model with asset and environmental covariates
Interpreted hazard ratios to understand which factors accelerate failure
Predicted individual survival curves and remaining useful life estimates
Created a risk-weighted replacement priority schedule with consequence scoring
Adjusted priorities for seasonal weather risk to time capital projects

Ideas to Try Next

Time-varying covariates: Use lifelines.CoxTimeVaryingFitter to model how changing conditions (degrading health index over time) affect hazard
Accelerated Failure Time models: Try lifelines.WeibullAFTFitter or LogNormalAFTFitter for parametric alternatives to Cox PH
Extend to other assets: Apply the same survival framework to poles (assets/poles.csv) and conductors (assets/conductors.csv)
Budget optimization: Use the priority ranking with replacement cost estimates to maximize risk reduction per dollar of capital spending
Concordance index tuning: Evaluate model discrimination with cph.concordance_index_ and tune the penalizer and feature set

Key Terms Glossary

Survival analysis — a statistical framework for modeling time-to-event data, accounting for censored observations
Censoring — when the event of interest (failure) has not yet occurred for a subject; the true failure time is unknown but at least as long as the observed time
Kaplan–Meier estimator — a non-parametric method for estimating the survival function from time-to-event data
Log-rank test — a statistical test comparing survival distributions between two or more groups
Cox Proportional Hazards — a semi-parametric regression model that estimates how covariates affect the hazard (failure rate)
Hazard ratio — the multiplicative effect of a one-unit change in a covariate on the failure rate; HR > 1 means increased risk
Remaining useful life (RUL) — the estimated time remaining before an asset reaches its predicted median failure point
Consequence scoring — weighting failure risk by the impact of that failure (customers affected, kVA lost)

← Beginner: Predictive Maintenance Next: Advanced FLISR →