Guide 04: Predictive Asset Maintenance - ML Playground

What You Will Learn

Utilities spend billions of dollars maintaining and replacing aging infrastructure. Instead of replacing equipment on a fixed schedule (time-based maintenance), predictive maintenance uses data to identify which assets are most likely to fail soon—so crews can prioritize the right work. In this guide you will:

Load transformer, maintenance, and outage data from the SP&L dataset
Engineer features from asset age, condition scores, loading history, and failure records
Train an XGBoost classifier to predict transformer failure risk
Evaluate your model and generate a risk-ranked asset list
Visualize which factors contribute most to failure risk

What is XGBoost? XGBoost (eXtreme Gradient Boosting) is an optimized version of Gradient Boosting that trains faster and often produces more accurate results. It is one of the most widely used ML algorithms in industry and dominates tabular data competitions on Kaggle.

SP&L Data You Will Use

assets/transformers.csv — 86 transformers with kVA rating, installation year, manufacturer, type, and health index (1–5)
assets/maintenance_log.csv — inspection dates, work orders, and replacement records
outages/outage_events.csv — historical outage events linked to equipment failures
weather/hourly_observations.csv — weather exposure data for environmental stress analysis

Additional Libraries

pip install xgboost

Which terminal should I use? On Windows, open Anaconda Prompt from the Start Menu (or PowerShell / Command Prompt if Python is already in your PATH). On macOS, open Terminal from Applications → Utilities. On Linux, open your default terminal. All pip install commands work the same across platforms.

Load the Data

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, roc_auc_score, roc_curve

# Point this to your local clone of the SP&L repo
# Windows example: "C:/Users/YourName/Documents/sisyphean-power-and-light/"
# macOS example:   "/Users/YourName/Documents/sisyphean-power-and-light/"
# Tip: Python on Windows accepts forward slashes — no backslashes needed
DATA_DIR = "sisyphean-power-and-light/"

# Load asset data
transformers = pd.read_csv(DATA_DIR + "assets/transformers.csv")
maintenance  = pd.read_csv(DATA_DIR + "assets/maintenance_log.csv",
                            parse_dates=["inspection_date"])
outages      = pd.read_csv(DATA_DIR + "outages/outage_events.csv",
                            parse_dates=["fault_detected"])

print(f"Transformers:      {len(transformers)}")
print(f"Maintenance logs:  {len(maintenance)}")
print(f"Outage events:     {len(outages)}")
                    

Explore the Transformer Data

# What columns do we have?
print(transformers.columns.tolist())
print(transformers.head())

# Distribution of health index scores (1 = worst, 5 = best)
transformers["health_index"].value_counts().sort_index().plot(
    kind="bar", color="#5FCCDB", title="Transformer Health Index Distribution"
)
plt.xlabel("Health Index (1=Poor, 5=Excellent)")
plt.ylabel("Count")
plt.tight_layout()
plt.show()
                    

# Age distribution
transformers["age_years"] = 2025 - transformers["install_year"]

plt.figure(figsize=(8, 4))
plt.hist(transformers["age_years"], bins=20, color="#2D6A7A", edgecolor="white")
plt.title("Transformer Age Distribution")
plt.xlabel("Age (years)")
plt.ylabel("Count")
plt.tight_layout()
plt.show()
                    

Create the Failure Target

We need to label each transformer: has it experienced an equipment-failure outage? We'll use the outage event log to identify transformers linked to "equipment_failure" cause codes.

# Filter outages to equipment failures only
equip_failures = outages[outages["cause_code"] == "equipment_failure"]

# Count equipment-failure outages per transformer
failure_counts = equip_failures.groupby("transformer_id").size().reset_index(
    name="failure_count"
)

# Merge with transformer table
df = transformers.merge(failure_counts, on="transformer_id", how="left")
df["failure_count"] = df["failure_count"].fillna(0).astype(int)

# Binary target: has this transformer ever failed?
df["has_failed"] = (df["failure_count"] > 0).astype(int)

print(f"Transformers with failures:    {df['has_failed'].sum()}")
print(f"Transformers without failures: {(df['has_failed'] == 0).sum()}")
                    

Engineer Maintenance Features

Maintenance history tells us a lot about asset health. Transformers with many work orders, or long gaps since the last inspection, may be at higher risk.

# Count maintenance events per transformer
maint_counts = maintenance.groupby("transformer_id").agg(
    total_work_orders=("work_order_id", "count"),
    last_inspection=("inspection_date", "max")
).reset_index()

# Days since last inspection
maint_counts["days_since_inspection"] = (
    pd.Timestamp("2025-01-01") - maint_counts["last_inspection"]
).dt.days

# Merge into main table
df = df.merge(maint_counts[["transformer_id", "total_work_orders",
               "days_since_inspection"]], on="transformer_id", how="left")

# Fill transformers with no maintenance records
df["total_work_orders"] = df["total_work_orders"].fillna(0)
df["days_since_inspection"] = df["days_since_inspection"].fillna(9999)

print(df[["transformer_id", "age_years", "health_index",
          "total_work_orders", "has_failed"]].head(10))
                    

Why fill missing values with 9999? If a transformer has no inspection record, it means it hasn't been inspected recently (or ever). Using a large number for days_since_inspection encodes this "never inspected" state as high risk, which makes intuitive sense.

Prepare Features and Split

# Encode the transformer type as a number
df["type_code"] = df["type"].map({"oil": 0, "dry": 1}).fillna(0)

# Define features
feature_cols = [
    "age_years", "kva_rating", "health_index",
    "type_code", "total_work_orders", "days_since_inspection"
]

X = df[feature_cols]
y = df["has_failed"]

# Split 70/30 (smaller dataset so we keep more for testing)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

print(f"Training samples: {len(X_train)}")
print(f"Test samples:     {len(X_test)}")
                    

Train the XGBoost Model

# Calculate class imbalance ratio for XGBoost
neg = (y_train == 0).sum()
pos = (y_train == 1).sum()

model = XGBClassifier(
    n_estimators=100,
    max_depth=4,
    learning_rate=0.1,
    scale_pos_weight=neg / pos,  # handle class imbalance
    random_state=42,
    eval_metric="logloss"
)

model.fit(X_train, y_train)
print("XGBoost training complete.")
                    

Test and Evaluate

# Predict on the test set
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]  # probability of failure

# Classification report
print(classification_report(y_test, y_pred,
      target_names=["No Failure", "Failure"]))

# AUC score
auc = roc_auc_score(y_test, y_prob)
print(f"AUC-ROC Score: {auc:.3f}")
                    

What is AUC-ROC? AUC (Area Under the ROC Curve) measures how well the model distinguishes between positive and negative classes across all probability thresholds. A score of 1.0 is perfect, 0.5 is random guessing. For maintenance prioritization, anything above 0.7 is useful because you don't need perfect accuracy—you just need to rank assets by risk.

Plot the ROC Curve

# ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_prob)

fig, ax = plt.subplots(figsize=(6, 6))
ax.plot(fpr, tpr, color="#5FCCDB", linewidth=2, label=f"XGBoost (AUC = {auc:.3f})")
ax.plot([0, 1], [0, 1], color="gray", linestyle="--", label="Random Guess")
ax.set_xlabel("False Positive Rate")
ax.set_ylabel("True Positive Rate")
ax.set_title("ROC Curve: Transformer Failure Prediction")
ax.legend()
plt.tight_layout()
plt.show()
                    

Generate a Risk-Ranked Asset List

The real value of this model is not just accuracy—it's the ability to produce a prioritized list of assets that maintenance crews can act on.

# Score every transformer (not just the test set)
df["failure_risk_score"] = model.predict_proba(df[feature_cols])[:, 1]

# Sort by risk (highest first)
risk_list = df.sort_values("failure_risk_score", ascending=False)

print("Top 10 Highest-Risk Transformers:")
print(risk_list[["transformer_id", "age_years", "health_index",
                "failure_risk_score"]].head(10).to_string(index=False))
                    

Feature Importance

# Which factors contribute most to failure risk?
importances = pd.Series(model.feature_importances_, index=feature_cols)
importances = importances.sort_values(ascending=True)

fig, ax = plt.subplots(figsize=(8, 5))
importances.plot(kind="barh", color="#5FCCDB", ax=ax)
ax.set_title("Feature Importance: What Drives Transformer Failure?")
ax.set_xlabel("Importance Score")
plt.tight_layout()
plt.show()
                    

You will typically see age_years and health_index at the top. Older transformers with poor health scores are the highest-risk assets—which aligns with engineering intuition and validates the model.

✓

What You Built and Next Steps

Loaded transformer, maintenance, and outage data from the SP&L repository
Created a binary failure target from equipment-failure outage records
Engineered features from asset age, condition, maintenance history
Trained an XGBoost classifier with class-imbalance handling
Evaluated performance with classification report and ROC curve
Generated a risk-ranked asset list for maintenance prioritization

Ideas to Try Next

Add weather exposure: Calculate cumulative storm exposure per transformer from the weather data
Survival analysis: Use the lifelines library to model time-to-failure instead of binary failure
Include loading history: Use peak loading percentages from feeder load data to measure stress over time
Extend to poles and conductors: Apply the same approach to assets/poles.csv and assets/conductors.csv
Cost-benefit analysis: Combine failure probability with replacement cost and outage impact to optimize capital spending

Key Terms Glossary

XGBoost — an optimized gradient boosting library for high-performance ML
Predictive maintenance — using data to predict failures before they occur, replacing time-based schedules
Health index — a composite score (typically 1–5) representing overall asset condition
AUC-ROC — measures how well the model distinguishes between classes; 1.0 = perfect, 0.5 = random
Class imbalance — when one category (e.g., "no failure") is much more common than the other
scale_pos_weight — XGBoost parameter that compensates for class imbalance
Risk score — the model's predicted probability of failure, used to rank assets

Ready to Level Up?

In the advanced guide, you'll use survival analysis to predict when transformers will fail and build risk-prioritized replacement schedules.

Go to Advanced Predictive Maintenance →

← Prev: Hosting Capacity Next: FLISR & Restoration →

Predictive Asset Maintenance with XGBoost

What You Will Learn

SP&L Data You Will Use

Additional Libraries

Load the Data

Explore the Transformer Data

Create the Failure Target

Engineer Maintenance Features

Prepare Features and Split

Train the XGBoost Model

Test and Evaluate

Plot the ROC Curve

Generate a Risk-Ranked Asset List

Feature Importance

What You Built and Next Steps

Ideas to Try Next

Key Terms Glossary

Ready to Level Up?