← Back to All Guides
Guide 04

Predictive Asset Maintenance with XGBoost

What You Will Learn

Utilities spend billions of dollars maintaining and replacing aging infrastructure. Instead of replacing equipment on a fixed schedule (time-based maintenance), predictive maintenance uses data to identify which assets are most likely to fail soon—so crews can prioritize the right work. In this guide you will:

  • Load transformer, maintenance, and outage data from the SP&L dataset
  • Engineer features from asset age, condition scores, loading history, and failure records
  • Train an XGBoost classifier to predict transformer failure risk
  • Evaluate your model and generate a risk-ranked asset list
  • Visualize which factors contribute most to failure risk

What is XGBoost? XGBoost (eXtreme Gradient Boosting) is an optimized version of Gradient Boosting that trains faster and often produces more accurate results. It is one of the most widely used ML algorithms in industry and dominates tabular data competitions on Kaggle.

SP&L Data You Will Use

  • assets/transformers.csv — 86 transformers with kVA rating, installation year, manufacturer, type, and health index (1–5)
  • assets/maintenance_log.csv — inspection dates, work orders, and replacement records
  • outages/outage_events.csv — historical outage events linked to equipment failures
  • weather/hourly_observations.csv — weather exposure data for environmental stress analysis

Additional Libraries

pip install xgboost

Which terminal should I use? On Windows, open Anaconda Prompt from the Start Menu (or PowerShell / Command Prompt if Python is already in your PATH). On macOS, open Terminal from Applications → Utilities. On Linux, open your default terminal. All pip install commands work the same across platforms.

1

Load the Data

import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from xgboost import XGBClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report, roc_auc_score, roc_curve # Point this to your local clone of the SP&L repo # Windows example: "C:/Users/YourName/Documents/sisyphean-power-and-light/" # macOS example: "/Users/YourName/Documents/sisyphean-power-and-light/" # Tip: Python on Windows accepts forward slashes — no backslashes needed DATA_DIR = "sisyphean-power-and-light/" # Load asset data transformers = pd.read_csv(DATA_DIR + "assets/transformers.csv") maintenance = pd.read_csv(DATA_DIR + "assets/maintenance_log.csv", parse_dates=["inspection_date"]) outages = pd.read_csv(DATA_DIR + "outages/outage_events.csv", parse_dates=["fault_detected"]) print(f"Transformers: {len(transformers)}") print(f"Maintenance logs: {len(maintenance)}") print(f"Outage events: {len(outages)}")
2

Explore the Transformer Data

# What columns do we have? print(transformers.columns.tolist()) print(transformers.head()) # Distribution of health index scores (1 = worst, 5 = best) transformers["health_index"].value_counts().sort_index().plot( kind="bar", color="#5FCCDB", title="Transformer Health Index Distribution" ) plt.xlabel("Health Index (1=Poor, 5=Excellent)") plt.ylabel("Count") plt.tight_layout() plt.show()
# Age distribution transformers["age_years"] = 2025 - transformers["install_year"] plt.figure(figsize=(8, 4)) plt.hist(transformers["age_years"], bins=20, color="#2D6A7A", edgecolor="white") plt.title("Transformer Age Distribution") plt.xlabel("Age (years)") plt.ylabel("Count") plt.tight_layout() plt.show()
3

Create the Failure Target

We need to label each transformer: has it experienced an equipment-failure outage? We'll use the outage event log to identify transformers linked to "equipment_failure" cause codes.

# Filter outages to equipment failures only equip_failures = outages[outages["cause_code"] == "equipment_failure"] # Count equipment-failure outages per transformer failure_counts = equip_failures.groupby("transformer_id").size().reset_index( name="failure_count" ) # Merge with transformer table df = transformers.merge(failure_counts, on="transformer_id", how="left") df["failure_count"] = df["failure_count"].fillna(0).astype(int) # Binary target: has this transformer ever failed? df["has_failed"] = (df["failure_count"] > 0).astype(int) print(f"Transformers with failures: {df['has_failed'].sum()}") print(f"Transformers without failures: {(df['has_failed'] == 0).sum()}")
4

Engineer Maintenance Features

Maintenance history tells us a lot about asset health. Transformers with many work orders, or long gaps since the last inspection, may be at higher risk.

# Count maintenance events per transformer maint_counts = maintenance.groupby("transformer_id").agg( total_work_orders=("work_order_id", "count"), last_inspection=("inspection_date", "max") ).reset_index() # Days since last inspection maint_counts["days_since_inspection"] = ( pd.Timestamp("2025-01-01") - maint_counts["last_inspection"] ).dt.days # Merge into main table df = df.merge(maint_counts[["transformer_id", "total_work_orders", "days_since_inspection"]], on="transformer_id", how="left") # Fill transformers with no maintenance records df["total_work_orders"] = df["total_work_orders"].fillna(0) df["days_since_inspection"] = df["days_since_inspection"].fillna(9999) print(df[["transformer_id", "age_years", "health_index", "total_work_orders", "has_failed"]].head(10))

Why fill missing values with 9999? If a transformer has no inspection record, it means it hasn't been inspected recently (or ever). Using a large number for days_since_inspection encodes this "never inspected" state as high risk, which makes intuitive sense.

5

Prepare Features and Split

# Encode the transformer type as a number df["type_code"] = df["type"].map({"oil": 0, "dry": 1}).fillna(0) # Define features feature_cols = [ "age_years", "kva_rating", "health_index", "type_code", "total_work_orders", "days_since_inspection" ] X = df[feature_cols] y = df["has_failed"] # Split 70/30 (smaller dataset so we keep more for testing) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=42, stratify=y ) print(f"Training samples: {len(X_train)}") print(f"Test samples: {len(X_test)}")
6

Train the XGBoost Model

# Calculate class imbalance ratio for XGBoost neg = (y_train == 0).sum() pos = (y_train == 1).sum() model = XGBClassifier( n_estimators=100, max_depth=4, learning_rate=0.1, scale_pos_weight=neg / pos, # handle class imbalance random_state=42, eval_metric="logloss" ) model.fit(X_train, y_train) print("XGBoost training complete.")
7

Test and Evaluate

# Predict on the test set y_pred = model.predict(X_test) y_prob = model.predict_proba(X_test)[:, 1] # probability of failure # Classification report print(classification_report(y_test, y_pred, target_names=["No Failure", "Failure"])) # AUC score auc = roc_auc_score(y_test, y_prob) print(f"AUC-ROC Score: {auc:.3f}")

What is AUC-ROC? AUC (Area Under the ROC Curve) measures how well the model distinguishes between positive and negative classes across all probability thresholds. A score of 1.0 is perfect, 0.5 is random guessing. For maintenance prioritization, anything above 0.7 is useful because you don't need perfect accuracy—you just need to rank assets by risk.

8

Plot the ROC Curve

# ROC curve fpr, tpr, thresholds = roc_curve(y_test, y_prob) fig, ax = plt.subplots(figsize=(6, 6)) ax.plot(fpr, tpr, color="#5FCCDB", linewidth=2, label=f"XGBoost (AUC = {auc:.3f})") ax.plot([0, 1], [0, 1], color="gray", linestyle="--", label="Random Guess") ax.set_xlabel("False Positive Rate") ax.set_ylabel("True Positive Rate") ax.set_title("ROC Curve: Transformer Failure Prediction") ax.legend() plt.tight_layout() plt.show()
9

Generate a Risk-Ranked Asset List

The real value of this model is not just accuracy—it's the ability to produce a prioritized list of assets that maintenance crews can act on.

# Score every transformer (not just the test set) df["failure_risk_score"] = model.predict_proba(df[feature_cols])[:, 1] # Sort by risk (highest first) risk_list = df.sort_values("failure_risk_score", ascending=False) print("Top 10 Highest-Risk Transformers:") print(risk_list[["transformer_id", "age_years", "health_index", "failure_risk_score"]].head(10).to_string(index=False))
10

Feature Importance

# Which factors contribute most to failure risk? importances = pd.Series(model.feature_importances_, index=feature_cols) importances = importances.sort_values(ascending=True) fig, ax = plt.subplots(figsize=(8, 5)) importances.plot(kind="barh", color="#5FCCDB", ax=ax) ax.set_title("Feature Importance: What Drives Transformer Failure?") ax.set_xlabel("Importance Score") plt.tight_layout() plt.show()

You will typically see age_years and health_index at the top. Older transformers with poor health scores are the highest-risk assets—which aligns with engineering intuition and validates the model.

What You Built and Next Steps

  1. Loaded transformer, maintenance, and outage data from the SP&L repository
  2. Created a binary failure target from equipment-failure outage records
  3. Engineered features from asset age, condition, maintenance history
  4. Trained an XGBoost classifier with class-imbalance handling
  5. Evaluated performance with classification report and ROC curve
  6. Generated a risk-ranked asset list for maintenance prioritization

Ideas to Try Next

  • Add weather exposure: Calculate cumulative storm exposure per transformer from the weather data
  • Survival analysis: Use the lifelines library to model time-to-failure instead of binary failure
  • Include loading history: Use peak loading percentages from feeder load data to measure stress over time
  • Extend to poles and conductors: Apply the same approach to assets/poles.csv and assets/conductors.csv
  • Cost-benefit analysis: Combine failure probability with replacement cost and outage impact to optimize capital spending

Key Terms Glossary

  • XGBoost — an optimized gradient boosting library for high-performance ML
  • Predictive maintenance — using data to predict failures before they occur, replacing time-based schedules
  • Health index — a composite score (typically 1–5) representing overall asset condition
  • AUC-ROC — measures how well the model distinguishes between classes; 1.0 = perfect, 0.5 = random
  • Class imbalance — when one category (e.g., "no failure") is much more common than the other
  • scale_pos_weight — XGBoost parameter that compensates for class imbalance
  • Risk score — the model's predicted probability of failure, used to rank assets

Ready to Level Up?

In the advanced guide, you'll use survival analysis to predict when transformers will fail and build risk-prioritized replacement schedules.

Go to Advanced Predictive Maintenance →