Guide 04: Predictive asset maintenance.

Adam BrownAuthor

Working notebookFormat

SP&L dataDataset

← Back to All Guides

Guide 04

Prefer not to install anything? Click the badge above to open this guide as a runnable notebook in Google Colab. Sign in with any Google account, then use Runtime → Run all to execute every cell, or step through them one at a time.

What You Will Learn

Utilities spend billions of dollars maintaining and replacing aging infrastructure. Instead of replacing equipment on a fixed schedule (time-based maintenance), predictive maintenance uses data to identify which assets are most likely to fail soon—so crews can prioritize the right work. In this guide you will:

Load transformer and outage data from the SP&L dataset
Engineer features from asset age, kVA rating, and outage history
Train an XGBoost classifier to predict transformer failure risk
Evaluate your model and generate a risk-ranked asset list
Visualize which factors contribute most to failure risk

What is XGBoost? XGBoost (eXtreme Gradient Boosting) is an optimized version of Gradient Boosting that trains faster and often produces more accurate results. It is one of the most widely used ML algorithms in industry and dominates tabular data competitions on Kaggle.

SP&L Data You Will Use

transformers.csv (load_transformers()) — ~21,000 transformers with kVA rating, age_years, manufacturer, phase, and status
outage_history.csv (load_outage_history()) — outage events linked to equipment failures with cause, duration, and customers affected
weather_data.csv (load_weather_data()) — weather exposure data for environmental stress analysis

Additional Libraries

pip install xgboost

Which terminal should I use? On Windows, open Anaconda Prompt from the Start Menu (or PowerShell / Command Prompt if Python is already in your PATH). On macOS, open Terminal from Applications → Utilities. On Linux, open your default terminal. All pip install commands work the same across platforms.

Verify Your Setup

Before starting, verify that your environment is configured correctly. Run this cell first to confirm all dependencies are installed and data files are accessible.

# Step 0: Verify your setup
try:
    import pandas as pd
    import numpy as np
    from sklearn.ensemble import RandomForestClassifier
    from demo_data.load_demo_data import load_transformers
    xfmrs = load_transformers()
    print(f"Setup OK! Loaded {len(xfmrs):,} transformers.")
except ModuleNotFoundError as e:
    print(f"Missing library: {e}")
    print("Run: pip install -r requirements.txt")
except FileNotFoundError:
    print("Data files not found. Run from the repo root:")
    print("  cd Dynamic-Network-Model && jupyter lab")
                    

Setup OK! Loaded N records.

Working directory: All guides assume your working directory is the repository root (Dynamic-Network-Model/). Start Jupyter Lab from there: cd Dynamic-Network-Model && jupyter lab

Extra dependency: pip install xgboost

Having trouble? Check our Troubleshooting Guide for solutions to common setup and data loading issues.

Load the Data

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, roc_auc_score, roc_curve

from demo_data.load_demo_data import (
    load_transformers, load_outage_history, load_weather_data
)

transformers = load_transformers()
outages      = load_outage_history()

print(f"Transformers:  {len(transformers)}")
print(f"Outage events: {len(outages)}")
                    

Explore the Transformer Data

# What columns do we have?
print(transformers.columns.tolist())
print(transformers.head())

# Distribution of transformer age
plt.figure(figsize=(8, 4))
plt.hist(transformers["age_years"], bins=20, color="#5FCCDB", edgecolor="white")
plt.title("Transformer Age Distribution")
plt.xlabel("Age (years)")
plt.ylabel("Count")
plt.tight_layout()
plt.show()
                    

# Rated kVA distribution
plt.figure(figsize=(8, 4))
plt.hist(transformers["kva_rating"], bins=20, color="#2D6A7A", edgecolor="white")
plt.title("Transformer Rated kVA Distribution")
plt.xlabel("Rated kVA")
plt.ylabel("Count")
plt.tight_layout()
plt.show()
                    

Create the Failure Target

We need to label each transformer: has it experienced an equipment-failure outage? We'll use the outage history to identify feeders linked to "equipment failure" causes and flag the transformers on those feeders.

# Filter outages to equipment failures only
equip_failures = outages[outages["cause_code"] == "equipment failure"]

# Count equipment-failure outages per feeder
failure_counts = equip_failures.groupby("feeder_id").size().reset_index(
    name="failure_count"
)

# Merge with transformer table on feeder_id
df = transformers.merge(failure_counts, on="feeder_id", how="left")
df["failure_count"] = df["failure_count"].fillna(0).astype(int)

# Binary target: has this transformer's feeder had equipment failures?
df["has_failed"] = (df["failure_count"] > 0).astype(int)

print(f"Transformers with failures:    {df['has_failed'].sum()}")
print(f"Transformers without failures: {(df['has_failed'] == 0).sum()}")
                    

Engineer Maintenance Features

Outage history can serve as a proxy for maintenance exposure. Feeders with frequent or long-duration outages suggest areas where equipment is under greater stress.

# Count all outage events per feeder
outage_stats = outages.groupby("feeder_id").agg(
    total_outages=("fault_detected", "count"),
    avg_outage_duration=("duration_hours", "mean")
).reset_index()

# Merge into main table
df = df.merge(outage_stats, on="feeder_id", how="left")

# Fill feeders with no outage records
df["total_outages"] = df["total_outages"].fillna(0)
df["avg_outage_duration"] = df["avg_outage_duration"].fillna(0)

print(df[["feeder_id", "age_years", "kva_rating",
          "total_outages", "has_failed"]].head(10))
                    

Why use outage statistics as features? Even without a dedicated maintenance log, the number and average duration of outages on a feeder capture real-world stress. Feeders with many long outages are likely serving equipment under greater strain.

Prepare Features and Split

# Define features
feature_cols = [
    "age_years", "kva_rating",
    "total_outages", "avg_outage_duration"
]

X = df[feature_cols]
y = df["has_failed"]

# Split 70/30 (smaller dataset so we keep more for testing)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

print(f"Training samples: {len(X_train)}")
print(f"Test samples:     {len(X_test)}")
                    

Train the XGBoost Model

# Calculate class imbalance ratio for XGBoost
neg = (y_train == 0).sum()
pos = (y_train == 1).sum()

model = XGBClassifier(
    n_estimators=100,
    max_depth=4,
    learning_rate=0.1,
    scale_pos_weight=neg / pos,  # handle class imbalance
    random_state=42,
    eval_metric="logloss"
)

model.fit(X_train, y_train)
print("XGBoost training complete.")
                    

Test and Evaluate

# Predict on the test set
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]  # probability of failure

# Classification report
print(classification_report(y_test, y_pred,
      target_names=["No Failure", "Failure"]))

# AUC score
auc = roc_auc_score(y_test, y_prob)
print(f"AUC-ROC Score: {auc:.3f}")
                    

What is AUC-ROC? AUC (Area Under the ROC Curve) measures how well the model distinguishes between positive and negative classes across all probability thresholds. A score of 1.0 is perfect, 0.5 is random guessing. For maintenance prioritization, anything above 0.7 is useful because you don't need perfect accuracy—you just need to rank assets by risk.

Plot the ROC Curve

# ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_prob)

fig, ax = plt.subplots(figsize=(6, 6))
ax.plot(fpr, tpr, color="#5FCCDB", linewidth=2, label=f"XGBoost (AUC = {auc:.3f})")
ax.plot([0, 1], [0, 1], color="gray", linestyle="--", label="Random Guess")
ax.set_xlabel("False Positive Rate")
ax.set_ylabel("True Positive Rate")
ax.set_title("ROC Curve: Transformer Failure Prediction")
ax.legend()
plt.tight_layout()
plt.show()
                    

Generate a Risk-Ranked Asset List

The real value of this model is not just accuracy—it's the ability to produce a prioritized list of assets that maintenance crews can act on.

# Score every transformer (not just the test set)
df["failure_risk_score"] = model.predict_proba(df[feature_cols])[:, 1]

# Sort by risk (highest first)
risk_list = df.sort_values("failure_risk_score", ascending=False)

print("Top 10 Highest-Risk Transformers:")
print(risk_list[["feeder_id", "age_years", "kva_rating",
                "failure_risk_score"]].head(10).to_string(index=False))
                    

Feature Importance

# Which factors contribute most to failure risk?
importances = pd.Series(model.feature_importances_, index=feature_cols)
importances = importances.sort_values(ascending=True)

fig, ax = plt.subplots(figsize=(8, 5))
importances.plot(kind="barh", color="#5FCCDB", ax=ax)
ax.set_title("Feature Importance: What Drives Transformer Failure?")
ax.set_xlabel("Importance Score")
plt.tight_layout()
plt.show()
                    

You will typically see age_years and total_outages at the top. Older transformers on feeders with frequent outages are the highest-risk assets—which aligns with engineering intuition and validates the model.

✓

What You Built and Next Steps

Loaded transformer and outage data from the SP&L data loader API
Created a binary failure target from equipment-failure outage records
Engineered features from asset age, kVA rating, and outage history
Trained an XGBoost classifier with class-imbalance handling
Evaluated performance with classification report and ROC curve
Generated a risk-ranked asset list for maintenance prioritization

Ideas to Try Next

Add weather exposure: Calculate cumulative storm exposure per transformer from the weather data
Survival analysis: Use the lifelines library to model time-to-failure instead of binary failure
Include loading history: Use peak loading percentages from feeder load data to measure stress over time
Extend to network edges: Apply the same approach using load_network_edges() data for conductor failure analysis
Cost-benefit analysis: Combine failure probability with replacement cost and outage impact to optimize capital spending

Ready to Level Up?

In the advanced guide, you'll use survival analysis to predict when transformers will fail and build risk-prioritized replacement schedules.

Go to Advanced Predictive Maintenance →

← Prev: Hosting Capacity Next: FLISR & Restoration →

— Adam · adam@sgridworks.com