← Back to All Guides
Guide 04

Predictive Asset Maintenance with XGBoost

Prefer not to install anything? Click the badge above to open this guide as a runnable notebook in Google Colab. Sign in with any Google account, then use Runtime → Run all to execute every cell, or step through them one at a time.

What You Will Learn

Utilities spend billions of dollars maintaining and replacing aging infrastructure. Instead of replacing equipment on a fixed schedule (time-based maintenance), predictive maintenance uses data to identify which assets are most likely to fail soon—so crews can prioritize the right work. In this guide you will:

  • Load transformer and outage data from the SP&L dataset
  • Engineer features from asset age, kVA rating, and outage history
  • Train an XGBoost classifier to predict transformer failure risk
  • Evaluate your model and generate a risk-ranked asset list
  • Visualize which factors contribute most to failure risk

What is XGBoost? XGBoost (eXtreme Gradient Boosting) is an optimized version of Gradient Boosting that trains faster and often produces more accurate results. It is one of the most widely used ML algorithms in industry and dominates tabular data competitions on Kaggle.

SP&L Data You Will Use

  • transformers.csv (load_transformers()) — ~21,000 transformers with kVA rating, age_years, manufacturer, phase, and status
  • outage_history.csv (load_outage_history()) — outage events linked to equipment failures with cause, duration, and customers affected
  • weather_data.csv (load_weather_data()) — weather exposure data for environmental stress analysis

Additional Libraries

pip install xgboost

Which terminal should I use? On Windows, open Anaconda Prompt from the Start Menu (or PowerShell / Command Prompt if Python is already in your PATH). On macOS, open Terminal from Applications → Utilities. On Linux, open your default terminal. All pip install commands work the same across platforms.

0

Verify Your Setup

Before starting, verify that your environment is configured correctly. Run this cell first to confirm all dependencies are installed and data files are accessible.

# Step 0: Verify your setup try: import pandas as pd import numpy as np from sklearn.ensemble import RandomForestClassifier from demo_data.load_demo_data import load_transformers xfmrs = load_transformers() print(f"Setup OK! Loaded {len(xfmrs):,} transformers.") except ModuleNotFoundError as e: print(f"Missing library: {e}") print("Run: pip install -r requirements.txt") except FileNotFoundError: print("Data files not found. Run from the repo root:") print(" cd Dynamic-Network-Model && jupyter lab")
Setup OK! Loaded N records.

Working directory: All guides assume your working directory is the repository root (Dynamic-Network-Model/). Start Jupyter Lab from there: cd Dynamic-Network-Model && jupyter lab

Extra dependency: pip install xgboost

Having trouble? Check our Troubleshooting Guide for solutions to common setup and data loading issues.

1

Load the Data

import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from xgboost import XGBClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report, roc_auc_score, roc_curve from demo_data.load_demo_data import ( load_transformers, load_outage_history, load_weather_data ) transformers = load_transformers() outages = load_outage_history() print(f"Transformers: {len(transformers)}") print(f"Outage events: {len(outages)}")
2

Explore the Transformer Data

# What columns do we have? print(transformers.columns.tolist()) print(transformers.head()) # Distribution of transformer age plt.figure(figsize=(8, 4)) plt.hist(transformers["age_years"], bins=20, color="#5FCCDB", edgecolor="white") plt.title("Transformer Age Distribution") plt.xlabel("Age (years)") plt.ylabel("Count") plt.tight_layout() plt.show()
# Rated kVA distribution plt.figure(figsize=(8, 4)) plt.hist(transformers["kva_rating"], bins=20, color="#2D6A7A", edgecolor="white") plt.title("Transformer Rated kVA Distribution") plt.xlabel("Rated kVA") plt.ylabel("Count") plt.tight_layout() plt.show()
3

Create the Failure Target

We need to label each transformer: has it experienced an equipment-failure outage? We'll use the outage history to identify feeders linked to "equipment failure" causes and flag the transformers on those feeders.

# Filter outages to equipment failures only equip_failures = outages[outages["cause_code"] == "equipment failure"] # Count equipment-failure outages per feeder failure_counts = equip_failures.groupby("feeder_id").size().reset_index( name="failure_count" ) # Merge with transformer table on feeder_id df = transformers.merge(failure_counts, on="feeder_id", how="left") df["failure_count"] = df["failure_count"].fillna(0).astype(int) # Binary target: has this transformer's feeder had equipment failures? df["has_failed"] = (df["failure_count"] > 0).astype(int) print(f"Transformers with failures: {df['has_failed'].sum()}") print(f"Transformers without failures: {(df['has_failed'] == 0).sum()}")
4

Engineer Maintenance Features

Outage history can serve as a proxy for maintenance exposure. Feeders with frequent or long-duration outages suggest areas where equipment is under greater stress.

# Count all outage events per feeder outage_stats = outages.groupby("feeder_id").agg( total_outages=("fault_detected", "count"), avg_outage_duration=("duration_hours", "mean") ).reset_index() # Merge into main table df = df.merge(outage_stats, on="feeder_id", how="left") # Fill feeders with no outage records df["total_outages"] = df["total_outages"].fillna(0) df["avg_outage_duration"] = df["avg_outage_duration"].fillna(0) print(df[["feeder_id", "age_years", "kva_rating", "total_outages", "has_failed"]].head(10))

Why use outage statistics as features? Even without a dedicated maintenance log, the number and average duration of outages on a feeder capture real-world stress. Feeders with many long outages are likely serving equipment under greater strain.

5

Prepare Features and Split

# Define features feature_cols = [ "age_years", "kva_rating", "total_outages", "avg_outage_duration" ] X = df[feature_cols] y = df["has_failed"] # Split 70/30 (smaller dataset so we keep more for testing) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=42, stratify=y ) print(f"Training samples: {len(X_train)}") print(f"Test samples: {len(X_test)}")
6

Train the XGBoost Model

# Calculate class imbalance ratio for XGBoost neg = (y_train == 0).sum() pos = (y_train == 1).sum() model = XGBClassifier( n_estimators=100, max_depth=4, learning_rate=0.1, scale_pos_weight=neg / pos, # handle class imbalance random_state=42, eval_metric="logloss" ) model.fit(X_train, y_train) print("XGBoost training complete.")
7

Test and Evaluate

# Predict on the test set y_pred = model.predict(X_test) y_prob = model.predict_proba(X_test)[:, 1] # probability of failure # Classification report print(classification_report(y_test, y_pred, target_names=["No Failure", "Failure"])) # AUC score auc = roc_auc_score(y_test, y_prob) print(f"AUC-ROC Score: {auc:.3f}")

What is AUC-ROC? AUC (Area Under the ROC Curve) measures how well the model distinguishes between positive and negative classes across all probability thresholds. A score of 1.0 is perfect, 0.5 is random guessing. For maintenance prioritization, anything above 0.7 is useful because you don't need perfect accuracy—you just need to rank assets by risk.

8

Plot the ROC Curve

# ROC curve fpr, tpr, thresholds = roc_curve(y_test, y_prob) fig, ax = plt.subplots(figsize=(6, 6)) ax.plot(fpr, tpr, color="#5FCCDB", linewidth=2, label=f"XGBoost (AUC = {auc:.3f})") ax.plot([0, 1], [0, 1], color="gray", linestyle="--", label="Random Guess") ax.set_xlabel("False Positive Rate") ax.set_ylabel("True Positive Rate") ax.set_title("ROC Curve: Transformer Failure Prediction") ax.legend() plt.tight_layout() plt.show()
9

Generate a Risk-Ranked Asset List

The real value of this model is not just accuracy—it's the ability to produce a prioritized list of assets that maintenance crews can act on.

# Score every transformer (not just the test set) df["failure_risk_score"] = model.predict_proba(df[feature_cols])[:, 1] # Sort by risk (highest first) risk_list = df.sort_values("failure_risk_score", ascending=False) print("Top 10 Highest-Risk Transformers:") print(risk_list[["feeder_id", "age_years", "kva_rating", "failure_risk_score"]].head(10).to_string(index=False))
10

Feature Importance

# Which factors contribute most to failure risk? importances = pd.Series(model.feature_importances_, index=feature_cols) importances = importances.sort_values(ascending=True) fig, ax = plt.subplots(figsize=(8, 5)) importances.plot(kind="barh", color="#5FCCDB", ax=ax) ax.set_title("Feature Importance: What Drives Transformer Failure?") ax.set_xlabel("Importance Score") plt.tight_layout() plt.show()

You will typically see age_years and total_outages at the top. Older transformers on feeders with frequent outages are the highest-risk assets—which aligns with engineering intuition and validates the model.

What You Built and Next Steps

  1. Loaded transformer and outage data from the SP&L data loader API
  2. Created a binary failure target from equipment-failure outage records
  3. Engineered features from asset age, kVA rating, and outage history
  4. Trained an XGBoost classifier with class-imbalance handling
  5. Evaluated performance with classification report and ROC curve
  6. Generated a risk-ranked asset list for maintenance prioritization

Ideas to Try Next

  • Add weather exposure: Calculate cumulative storm exposure per transformer from the weather data
  • Survival analysis: Use the lifelines library to model time-to-failure instead of binary failure
  • Include loading history: Use peak loading percentages from feeder load data to measure stress over time
  • Extend to network edges: Apply the same approach using load_network_edges() data for conductor failure analysis
  • Cost-benefit analysis: Combine failure probability with replacement cost and outage impact to optimize capital spending

Key Terms Glossary

  • XGBoost — an optimized gradient boosting library for high-performance ML
  • Predictive maintenance — using data to predict failures before they occur, replacing time-based schedules
  • AUC-ROC — measures how well the model distinguishes between classes; 1.0 = perfect, 0.5 = random
  • Class imbalance — when one category (e.g., "no failure") is much more common than the other
  • scale_pos_weight — XGBoost parameter that compensates for class imbalance
  • Risk score — the model's predicted probability of failure, used to rank assets

Ready to Level Up?

In the advanced guide, you'll use survival analysis to predict when transformers will fail and build risk-prioritized replacement schedules.

Go to Advanced Predictive Maintenance →